Accessing and using the RDF data in Stanbol

Instant Apache Stanbol


July 2013

$14.99

Learn how to deploy Stanbol to extend content management with semantic services

(For more resources related to this topic, see here.)

Getting ready

To start with, we need a Stanbol instance and Node.js. Additionally, we need the file rdfstore-js, which can be installed by executing the following command line:

> npm install rdfstore

How to do it...

  1. We create a file rdf-client.js with the following code:

    var rdfstore = require('rdfstore');
    var request = require('request');
    var fs = require('fs');
    rdfstore.create(function(store) {
    function load(files, callback) {
    var filesToLoad = files.length;
    for (var i = 0; i < files.length; i++) {
    var file = files[i]
    fs.createReadStream(file).pipe(
    request.post( {
    url: 'http://localhost:8080/enhancer?uri=file:
    ///' + file,
    headers: {accept: "text/turtle"}
    },
    function(error, response, body) {
    if (!error && response.statusCode == 200) {
    store.load(
    "text/turtle",
    body,
    function(success, results) {
    console.log('loaded: ' + results + " triples
    from file" + file);
    if (--filesToLoad === 0) {
    callback()
    }
    }
    );
    }
    else {
    console.log('Got status code: ' +
    response.statusCode);
    }
    }));
    }
    }
    load(['testdata.txt', 'testdata2.txt'], function() {
    store.execute(
    "PREFIX enhancer:<http://fise.iks-project.
    eu/ontology/> \
    PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> \
    SELECT ?label ?source { \
    ?a enhancer:extracted-from ?source. \
    ?a enhancer:entity-reference ?e. \
    ?e rdfs:label ?label.\
    FILTER (lang(?label) = \"en\") \
    }",
    function(success, results) {
    if (success) {
    console.log("*******************");
    for (var i = 0; i < results.length; i++) {
    console.log(results[i].label.value +
    " in " + results[i].source.value);
    }
    }
    });
    });
    });

  2. Create the data files:

    Our client loads two files. We use a simple testdata.txt file having the content:

    "The Stanbol enhancer can detect famous cities such as Paris and people such as
    Bob Marley."

    And a second testdata2.txt file with the following content:

    "Bob Marley never had a concert in Vatican City."

  3. We execute the code using Node.js command line:

    > node rdf-client.js

    The output is:

    loaded: 159 triples from file testdata2.txt
    loaded: 140 triples from file testdata2.txt
    *******************
    Vatican City in file:///testdata2.txt
    Bob Marley in file:///testdata2.txt
    Bob Marley in file:///testdata.txt
    Paris, Texas in file:///testdata.txt
    Paris in file:///testdata.txt

  4. This time we see the labels of the entities and the file in which they appear.

How it works…

Unlike the usual clients, this client no longer analyses the returned JavaScript Object Notation (JSON) but processes the returned data as RDF. An RDF document is a directed graph. The following screenshot shows some RDF rendered as graph by the W3C

We can create such an image by selecting RDF/XML as the output format on localhost:8080/enhancer , copying and pasting the XML generated, and running the engines on some text to www.w3.org/RDF/Validator/ , where we can request that triples and graphs be generated from it. Triples are the other way to look at RDF. An RDF graph (or document) is a set of triples of the form– subject-predicate-object, where subject and object are the nodes (vertices) and predicate is the arc (edge). Every triple is a statement describing a property of its subject:

<urn:enhancement-f488d7ce-a1b7-faa6-0582-0826854eab5e> <http://fise.
iks-project.eu/ontology/entity-reference> <http://dbpedia.org/resource/
Bob_Marley>.
<http://dbpedia.org/resource/Bob_Marley>
<http://www.w3.org/2000/01/rdf-schema#label> "Bob Marley"@en .

There are two triples saying that an enhancement referenced Bob Marley and that the English label for Bob Marley is "Bob Marley". All the arches and most of the nodes are labeled by an Internationalized Resource Identifier (IRI), which defines a superset of the good old URLs including non-Latin characters.

RDF can be serialized in many different formats. The two triples in the preceding command lines use the N-TRIPLES syntax. RDF/XML expresses (serializes) RDF graphs as XML documents. Originally, RDF/XML was referred to as the canonical serialization for RDF. Unfortunately, this caused some people to believe RDF would be somehow related to XML and thus inherit its flaws. A serialization format designed specifically for RDF that doesn't encode RDF into an existing format is Turtle. Turtle allows both explicit listing of triples as in N-TRIPLES but also supports various ways of expressing the graphs in a more concise and readable fashion. The JSON-LD, expresses RDF graphs in JSON. As this specification is currently still work in progress (see json-ld.org/), different implementations are incompatible and thus, for this example, we switched the Accept-Header to text/turtle.

Another change in the code performing the request is that we added a uri query-parameter to the requested URL:

'http://localhost:8080/enhancer?uri=file:///' + file,

 

This defines the IRI naming used as a name for the uploaded content in the result graph. If this parameter is not specified, the enhancer will generate an IRI which is based on creating a hash of the content. But this line in the output would be less helpful:

Paris in urn:content-item-sha1-3b16820497aae806f289419d541c770bbf87a796

Roughly the first half of our code takes care of sending the files to Stanbol and storing the returned RDF. We define a function load that asynchronously enhances a bunch of files and invokes a callback function when all files have successfully been loaded.

The second half of the code is the function that's executed once all files have been processed. At this point, we have all the triples loaded in the store. We could now programmatically access the triples one by one, but it's easier to just query for the data we're interested in. SPARQL is a query language a bit similar to SQL but designed to query triple stores rather than relational databases. In our program, we have the following query (slightly simplified here):

PREFIX enhancer:<http://fise.iks-project.eu/ontology/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label ?source {
?a enhancer:extracted-from ?source.
?a enhancer:entity-reference ?e.
?e rdfs:label ?label. }

The most important part is the section between curly brackets. This is a graph pattern that is like a graph, but with some variables instead of values. On execution, the SPARQL engine will check for parts of the RDF matching this pattern and return a table with a row for each selected value and a row for every matching value combination. In our case, we iterate through the result and output the label of the entity and the document in which the entity was referenced.

There's more...

The advantage of RDF is that many tools can deal with the data, ranging from command line tools such as rapper (librdf.org/raptor/rapper.html) for converting data to server applications, which allow to store large amounts of RDF data and build applications on top of it.

Summary

In this recipe, the advantage of using RDF (model-based) over the conventional JSON (syntax-based)method is explained. In the article, a client was created, rdf-client.js, which loaded two files, testdata.txt and testdata2.txt, and then were executed using Node.js command prompt. An RDF was rendered using W3C in the form of triples. Later, using SPARQL the triples were queried to extract the required information.

Resources for Article :


Further resources on this subject:


Books to Consider

comments powered by Disqus