working with RDF

The foundation of the semantic web technologies stack as conceived by the W3C is the Resource Description Framework(RDF).

clj-plaza tries to make the use of RDF in Clojure easy and idiomatic.

RDF is different things at the same time

  • a data model based in a graph of linked URIs
  • a vocabulary for data description
  • a formal system defining the validity of the described data and inference rules

RDF data model is based on two main building blocks: URIs and RDF literals. URIs can be used to
identify any resource to be described. Resources can have properties, also identified by URIs,
linking resources or associating values to a resource. The primitive values of the properties are
described using RDF literals.

The graph of relations among resources and literal values can be serialized as a set of triples with
subject, predicate and object, where the URI in a triple object can be the subject of a different triple.

For instance, the RDF graph:


can be serialized as:

<http://www.example.org/index.html>
      <http://purl.org/dc/elements/1.1/creator> <http://www.example.org/staffid/85740> .

Describing sets of triples

In clj-plaza, you describe a triple as a vector with three components, and a RDF graph as a vector
with triples. The previous graph can be encoded with clj-plaza using the following code:

(use 'plaza.rdf.core)
(make-triples [["http://www.example.org/index.html"
                  "http://purl.org/dc/elements/1.1/creator"
                  "http://www.example.org/staffid/85740"]])

This way of describing data is quite verbose so it is a good idea to define the namespaces we are
going to use. This way the description of data can be less tedious. clj-plaza offers some
functions for dealing with namespaces:

(register-rdf-ns :ex "http://www.example.org/index.html")
(register-rdf-ns :dc "http://purl.org/dc/elements/1.1/")
(make-triples [[[:ex :index.html] [:dc :creator] [:ex "staffid/85740"]]])

We can also set up a default namespace so all the unqualified strings will be resolved using this
default namespace:

(alter-root-rdf-ns "http://www.example.org/")
(make-triples [[:index.html [:dc :creator] "staffid/85740"]])

This way of modifying the default namespace can cause problems in a multithreaded environment and it
should be only used when setting the environment at the start of the application execution.

clj-plaza includes a different mechanism for changing temporarily the default namespace in a per
thread basis using the with-rdf-ns macro:

(with-rdf-ns "http://www.example.org/"
   (make-triples [[:index.html [:dc :creator] "staffid/85740"]]))

Sometimes you need to build a graph where there is a lot of different properties relating to the same
resource. For instance, if you need to describe a graph like:

You can describe it using a collection of pairs of property – object, associated to the same subject:

(make-triples [[:a [:p :b
                      :r :c
                      :s :d]]])

Literals and typed literals

When describing a graph, some properties will be relations between resources, identified by URIs,
and some other properties will link resources to data describing that resource. These latest values
are inserted into the RDF graph as literal nodes. Literal nodes can have an associated type as described in the
RDF specification.

We can use the functions rdf-literal and rdf-typed-literal to add some
values to a graph:

(def *book-uri* "http://www.amazon.com/Art-Metaobject-Protocol-Gregor-Kiczales/dp/0262610744")
(def *book-title* "The Art of the Metaobject Protocol")
(def *publication* (let [c (java.util.Calendar/getInstance)]
                            (.set c 1991 12 1) c))
(make-triples
  [[ *book-uri* [ [:dc :title] (rdf-literal *book-title*)
                  [:dc :date]  (rdf-typed-literal *publication*
                                                    :datetime) ]]])

To make this representation more compact, we can use the l and d shortcuts
for the rdf-literal and rdf-typed-literal functions. It is also not
necessary to pass as an argument the type of the data type, it will be inferred automatically.

With all this modifications, it is possible to specify the same graph as:

(def *book-uri* "http://www.amazon.com/Art-Metaobject-Protocol-Gregor-Kiczales/dp/0262610744")
(def *book-title* "The Art of the Metaobject Protocol")
(def *publication* (let [c (java.util.Calendar/getInstance)]
                            (.set c 1991 12 1) c))
(make-triples
  [[ *book-uri* [ [:dc :title] (l *book-title*)
                  [:dc :date]  (d *publication*) ]]])

Some other possibilities include passing a language for the strings in the literal values as well
as a function for dealing directly with dates:

(def *book-uri* "http://www.amazon.com/Art-Metaobject-Protocol-Gregor-Kiczales/dp/0262610744")
(def *book-title* "The Art of the Metaobject Protocol")

(make-triples
  [[ *book-uri* [ [:dc :title] (l *book-title* :en)
                  [:dc :date]  (date 1991 12 1) ]]])

Models

Collections of triples can be stored into an intermediary shared memory zone denominated a model where some
model operations can be applied. All
model operations can be invoked safely by different threads.

There is a default model bound to the *rdf-model*. Every single model operation will
manipulate by default this shared model.

New models can be created using the defmodel macro. You can also pass some forms as
arguments to defmodel, model operations in these forms will be applied to the model
being defined:

(def *ml* (defmodel
             (model-add-triples (make-triples [[:a :b (d 2)]]))
             (model-add-triples (make-triples [[:e :f (l "test")]]))))

Triples stored inside a model can always be recovered using the model-to-triples
function. These triples can be manipulated without affecting the model where the triples where extracted from.

(= 2 (count (model-to-triples *ml*)))

When several models are defined, the model where operations will be applied can be chosen using the
with-model macro:

(with-model *ml*
             (model-remove-triples (make-triples [[:a :b (d 2)]])))

Any model operation accepts, as a convenience, the representation of a set of
triples instead of the triple-set:

(with-model *ml*
             (model-remove-triples [[:e :f (l "test")]]))
(= 0 (model-to-triples *ml*))

Input/Output

Triples inside a model can be serialized to different formats using the
model-to-document function.

Currently supported formats are:

  • XML/RDF
  • N3
  • Turtle

The following example code:

(def *m* (with-rdf-ns "http://test.com/"
               (defmodel (model-add-triples [[:a :b (d 1)]]))))
(model-to-format *m* :xml)

Will generate the following output:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:j.0="http://test.com/" >
  <rdf:Description rdf:about="http://test.com/a">
    <j.0:b rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</j.0:b>
  </rdf:Description>
</rdf:RDF>

In the same way, other possible formats for graphs could be:

(model-to-format *m* :n3)
<http://test.com/a>
      <http://test.com/b> "1"^^<http://www.w3.org/2001/XMLSchema#int> .

Creating models from input streams it is also possible using the documet-to-model
function.

(document-to-model (java.io.FileInputStream. "/path/to/model.xml") :xml)

Parsing RDFa documents

RDFa is a different way of encoding any RDF
graph in a XHTML document. This format allows encoding data, data formatting information and meta data
in a single document.

The function model-to-document can be used to retrieve data from any web document
providing the associated URL.

For instance the RDF graph annotated in the HTML document containing this great presentation
by John Breslin about the social semantic web, can be retrieved and parsed with the following code:

(document-to-model "http://www.slideshare.net/Cloud/the-social-semantic-web-presentation" :html)

If we now try to output the graph in a different format:

(model-to-format *m* :n3)

The following document is generated:

@prefix dc:      <http://purl.org/dc/terms/> .
@prefix hx:      <http://purl.org/NET/hinclude> .
@prefix media:   <http://search.yahoo.com/searchmonkey/media/> .
@prefix og:      <http://opengraphprotocol.org/schema/> .
@prefix fb:      <http://developers.facebook.com/schema/> .

<http://www.slideshare.net/Cloud/the-social-semantic-web-presentation>
      fb:app_id "2490221586"@en ;
      og:image "http://cdn.slidesharecdn.com/200801027agalwayietcompsoc-1227870978372704-8-thumbnail-2?1227923401"@en ;
      og:site_name "SlideShare"@en ;
      og:title "The Social Semantic Web"@en ;
      og:url  "http://www.slideshare.net/Cloud/the-social-semantic-web-presentation"@en ;
      dc:creator "John Breslin"@en ;
      dc:description "IET Ireland Network and NUI Galway CompSoc Talk / DERI, NUI Galway / 27th November 2008"@en ;
      media:height "355"@en ;
      media:presentation <http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=200801027agalwayietcompsoc-1227870978372704-8&stripped_title=the-social-semantic-web-presentation> ;
      media:thumbnail <http://cdn.slidesharecdn.com/200801027agalwayietcompsoc-1227870978372704-8-thumbnail?1227923401> ;
      media:title "The Social Semantic Web"@en ;
      media:width "425"@en ;
      <http://www.w3.org/1999/xhtml/vocab#alternate>
              <http://www.slideshare.net/rss/latest> ;
      <http://www.w3.org/1999/xhtml/vocab#icon>
              <http://www.slideshare.net/favicon.ico> ;
      <http://www.w3.org/1999/xhtml/vocab#stylesheet>
              <http://public.slidesharecdn.com/v3/styles/combined.css?1273154379> .

An InputStream instead of an URI string is also a valid argument. In the same way, for
other formats different from RDFa, it is also possible to provide an URI pointing to the document to
be parsed.

Triple sets processing

Once the set of triples is built, or retrieved from a document, it consists just in a vector of triples.
This means that operations like map or filter can be applied to the
collection of triples as in any other Clojure collection. Each triple is also just a vector with
three components: subject (first component), predicate (second component) and object (third
component).

Nevertheless, each component is different from the representation provided to built the triple
collection. clj-plaza provides a certain number of functions to manipulate these
objects.

For example, the functions subject, predicate and object,
and their shortcuts: s, o and p, can be used to extract the
components of a triple.

If we would like to retrieve all the subjects of a set of triples we could write the
following code:

(def *ts* (make-triples [[:a :b (l "test")]
                            [:c [:d (d 1)
                                 :e (d 3.6)]]]))
(map #(s %1) *ts*)

The output of this code is not [:a :c :c] but a list of resource objects.

We could use the function resource-uri to obtain the URI of the resources:

(def *ts* (make-triples [[:a :b (l "test")]
                            [:c [:d (d 1)
                                 :e (d 3.6)]]]))
(map #(resource-uri (s %1)) *ts*)

The output of this code is:

("http://plaza.org/ontologies/a" "http://plaza.org/ontologies/c" "http://plaza.org/ontologies/c")

In the same way we could use the function resource-qname-local to extract the local
part of the qualified URI of the resource:

   (map #(resource-qname-local (s %1)) *ts*)

The output of this version of the code is:

("a" "c" "c")

Other examples of functions for retrieving data from the components of a triple set are
resource-qname-prefix, literal-value, literal-language,
literal-lexical-form or is-resource.

Using predicates

Other common problem is filtering and selecting parts of a triple set. clj-plaza
offers a small set of predicate functions that make easy querying a set of triples. All these
functions are included in the namespace plaza.rdf.predicates

For example, if we would like to retrieve all the triples in a triple set that have a literal as the
object we could write the following code:

(use 'plaza.rdf.predicates)
(filter (triple-check (object? (is-literal?))) *ts*)

All the predicates are introduced by the function triple-check or the shortcut
tc.

Queries can be combined using predicates and? and or? becoming quite
expressive:

(filter (triple-check
                  (and?  (subject? (qname-local? "a"))
                          (object? (and?  (is-literal?)
                                            (regex? #"test[a-z]*")))))
          *ts*)
 

Two special predicates are fn-apply? and fn-triple-apply? that receive a
lambda function as an argument returning a boolean value and pass as the only argument the current
value or triple being filtered. These predicates can be used to make custom checks on the triple
set.

For example we could write a query for retrieving all the triples with an object containing a
literal of type integer and value less than 5:

(filter (triple-check (object-and? (datatype? :int)
                                       (fn-apply? #(< %1 5)))
          *ts*)

Patterns, queries, triple sets and models

clj-plaza predicates are only a lightweight way of querying a RDF graph encoded as
a set of triples. The standard way of querying RDF graphs is the SPARQL query language.

The support for SPARQL included in the library can be found in the plaza.rdf.sparql
namespace.

The first concept introduced in this namespace is that of a pattern. A pattern can be thought as a
RDF graph where some nodes have been substituted by variables. A pattern could be later transformed
into a query object, in the same way that a clj-plaza triple set can be transformed
into a model object.

Patterns are just plain Clojure vectors where each element is also a vector with subject, predicate
and object.

Furhtermore, a triple set can be transformed into a pattern abstracting some values and a pattern
and it can be transformed back into a triple set binding all the variables in the pattern.

A pattern can also be applied to a triple set or model to obtain a collection of triple sets where
the pattern variables have been bound to values from the triple set.

Building patterns

A pattern can be built using the make-pattern function in the
plaza.rdf.sparql namespace:

(make-pattern [[:?s rdf:type :Post]])

The keyword :?s introduces a variable in the query. Any symbol starting with a
question mark is interpreted as a pattern variable. Variables ?a, ?b, ?c… etc are so common that
they are defined with especial symbols by the library:

(make-pattern [[?s rdf:type :Post]])

The variables in a pattern can be retrieved using the function
pattern-collect-vars. Variables inside a pattern can be transformed into values using
the pattern-bind function, providing a map with variable substitutions:

(pattern-bind [[?a rdf:type :Post]] {?a "post-a"})

A pattern can be transformed into a collection of triple sets, where the variables of the pattern
have been bound to the values of triples in the original collection matching the pattern triples:

(pattern-apply [[:ba rdf:type :Post]
                  [:bb rdf:type :Post]]
                 [[?a rdf:type :Post]])

will yield a list with two triple sets, in the first the variable ?a will be replaced by the
resource with local qname “ba” and in the second, the variable will be replaced by the resource with
local qname “bb”.

In the same way, a triple set can be transformed into a pattern abstracting some of the values in
the triple set using the function triples-abstraction.

triples-abstraction receives two arguments, a predicate, selecting the triples in the
triple set where values are going to be replaced by variables and a map with the components of the
previously matched triple to be replaced:

(triples-abstraction *ts* (subject? (uri? "http://plaza.org/ontologies/a")) {:subject :?x})

Defining queries

A pattern can be transformed into many SPARQL queries using the defquery macro.

This macro accepts a set of query building functions defining all the components of the query:

  • kind of query
  • variables to be returned
  • pattern
  • offset, limit, reduce, distinct clauses

The following sample of code defines a complex SPARQL query:

(defquery
      (query-set-vars [?s])
      (query-set-pattern (make-pattern [[?s ?p ?o]]))
      (query-set-type :select)
      (query-set-limit 2)
      (query-set-distinct)
      (query-set-offset 5)
      (query-set-reduced))

We can later use the function query-to-string to obtain the string representation of
the query:

"SELECT DISTINCT REDUCED ?s WHERE {?s ?p ?o .} OFFSET 5 LIMIT 2"

Some triples in a pattern can be marked as optional. These triples will only be
returned as a result by a SPARQL query or in a triple application if the variables can be bound.

Optional parts of a pattern can be defined using the optional function or the
opt shortcut.

(def *pattern* (make-pattern [[?x rdf:type :http://test.com/Test]
                                (optional [?y ?z (d 2)])]))

(defquery
              (query-set-vars [?y])
              (query-set-type :select)
              (query-set-pattern *pattern*))

If we print this query we will obtain the following string:

"SELECT ?y
WHERE { ?x rdf:type <http://test.com/Test> .
       OPTIONAL { ?y ?z '2'^^<http://www.w3.org/2001/XMLSchema#int> .} }"

Adding filters

The SPARQL standard defines a set of filter
functions
that can be added to a query to filter the solutions of a query.

The make-filter function can build this filters that can be later applied to a query
using the query-set-filters function, accepting a list of filters:

(def *pattern* (make-pattern [[?x rdf:type :http://test.com/Test]
                                (optional [?y ?z (d 2)])]))

(defquery
              (query-set-vars [?y])
              (query-set-type :select)
              (query-set-pattern *pattern*)
              (query-set-filters [(make-filter :> :?y (d 2))]))

make filter accepts a symbol identifying the filter function and some additional arguments that are
the arguments to the filter function.

Some example of filter functions available are: +, -, *,
div, str, lang, bound, or
isLiteral.

Using queries

Queries can be used to match variables in models. The function used is
model-query. This function receives a model and query and returns a list of maps with
the variables in the query matching the query:

(def *q* (defquery
                (query-set-type :select)
                (query-set-vars [:?x])
                (query-set-pattern
                 (make-pattern [[:?x "a" (d 2)]]))))

(def *ts* (make-triples [[:m :a (d 2)]
                         [:n :b (d 2)]
                         [:o :a (d 2)]
                         [:p :a (d 3)]]))

(def *m* (defmodel
           (model-add-triples triples)))

(model-query model query)

The previous code sample will return the following list of binding maps as the result:

({:?x #<ResourceImpl http://plaza.org/ontologies/o>}
 {:?x #<ResourceImpl http://plaza.org/ontologies/m>})

Alternatively, the function model-query-triples can be used to retrieve the triple sets
where the binding maps have been applied.

Backend support

One of the main problems when working with RDF and other semantic technologies
in the JVM is the lack of shared standard for W3C semantic recommendations.
Plaza protocol, types and other abstractions are defined in terms of two of the
most used implementations for these standards: Jena and OpenRDF Sesame.

Plaza also supports different triple stores where RDF graphs can be stored and
retrieved.

Jena

Jena is a widely used Java library to
work with RDF graphs, SPARQL and triple stores. It was developed by HP and is
licensed under an open source license.

Jena implementation can be found at plaza.rdf.implementations.jena.

Sesame

Sesame is a
well known RDF framework and triple store distributed by Aduna under an open
source license.

Jena implementation can be found at plaza.rdf.implementations.sesame.

Jena and Sesame jar libraries are included as Plaza dependencies, so they should
be alredy available in your classpath if you are using Leiningen or Maven to
manage your project.
Before Plaza can be used, the library must be initialized using either jena or
sesame as the backend library for Plaza. This can be achieved using the
init-jena-framework or init-sesame-framework. Changes
from one backend does not affect to the rest of the code written using Plaza.

Triple stores

By default, jena and sesame models are stored in memory so they cannot be
persisted but using serialization to a file using some RDF format.

Nevertheless, Plaza supports different kind of triple stores that can be used to
create models that are backed by different persistent triple stores.

Currently there is support for the following triple stores are supported by
plaza:

  • 4Store
  • Big Data
  • Mulgara
  • Allegro Graph 4

4Store

4Store is a triple store designed by
Garlik. The setup instruction for this store can be found at 4Store site for the cluster and
single node configurations.

Plaza support for 4Store can be found at
plaza.rdf.implementations.stores.4store.

In order to use 4Store models, the Clark & Parsia’s 4Store
library
must be installed and should be available in the classpath.

These are the required jars:

  • cp-common-fourstore-0.3.1.jar
  • cp-common-openrdf-0.2.1.jar
  • cp-common-utils-1.0.1.jar
  • openrdf-sesame-2.3.0-onejar.jar

All these jars are generated when compiling the Clark & Parsia library.

To use the model from Clojure code, the build-model function should
be used passing an additional parameter specifying the URL location of the
4Store triple store

(use 'plaza.rdf.core)
(use 'plaza.rdf.implementations.sesame)
(use 'plaza.rdf.implementations.stores.4store)

(init-sesame-framework)

(def *m* (build-model :4store :url "http://localhost:8000/"))

Bigdata

Bigdata is an implementation of
a triple store developed by SYSTAP and distributed as open source under the GPL2
license.

Detailed documentation about the setup and configuration of the store can be
found at Bigdata wiki

To use Clojure with Plaza, compile the latest version of Bigdata, plaza Bigdata
jar and dependencies in the classpath, and then use
the code in plaza.rdf.implementations.stores.bigdata to create a
new model. A properties file located in the classpath with right configuration
for Bigdata and a file where the repository is going to be created must be
provided.

(use 'plaza.rdf.core)
(use 'plaza.rdf.implementations.sesame)
(use 'plaza.rdf.implementations.stores.bigdata)

(init-sesame-framework)

(def *m* (build-model :bigdata :properties "rdfonly.properties" :file "testbigdata.jnl"))

Mulgara

Mulgara (previously Kowari) triple store
is an open source project for building a triple store available under open
source license. Download and installation information can be found at Mulgara’s wiki.

To use Mulgara the code in the namespace
plaza.rdf.implementations.stores.mulgara should be imported and the
:rmi property with the URI for the Mulgara server must be passed as
an argument to the build-model function.

Mulgara’s jar library must be located in the classpath of the application.

(use 'plaza.rdf.core)
(use 'plaza.rdf.implementations.jena)
(use 'plaza.rdf.implementations.stores.mulgara)

(init-jena-framework)

(def *m* (build-model :mulgara :rmi "rmi://localhost/server1"))

AllegroGraph 4

AllegroGraph 4 is a
commercial RDF triple store available from Franz Inc.

AllegroGraph stores can be used with Plaza using the code at
plaza.rdf.implementations.stores.agraph.

The jar library provided from Franz Inc. must be included in the classpath and
also the JSON library provided by json.org.

To build a new model, :server-url, :agraph-user,
:agraph-password, :catalog and
:repository must be provided.

(use 'plaza.rdf.core)
(use 'plaza.rdf.implementations.jena)
(use 'plaza.rdf.implementations.stores.agraph)

(init-jena-framework)

(def *m* (build-model :agraph :server-url "http://192.168.1.37:10035"
                        :agraph-user "test" :agraph-password "test"
                        :catalog "java-catalog" :repository "test2"))