RDF Containers

In this notebook I will

  1. Explain RDF Containers
  2. Include test cases for gastrodon functions that support containers
  3. Demonstrate the use of inference rules to simplify queries against RDF Containers

RDF Containers are one of two mechanisms to represent ordered collections in RDF:

  • Containers represent the order of members using predicate names that contain the sequence numbers of the members; they work much like an ArrayList in Java.
  • Collections represent the order of members using a linked list similar to lists in LISP or like the LinkedList in Java.

If you're developing a new application, you may need to choose to use one or the other for a particular use. From the viewpoint of this article, however, I'm working with a dataset that already uses one or the other, and the goal is to write queries againt it.

(Note: the words "container" and "collection" are usedly loosely in things you will read, for instance, people might refer to the Python list as being a "container" or being a "collection". A java List is a subclass of Collection, whereas the equivalent vector in C++ is defined in the container library, for instance. People writing natural languages are always going to be ambiguous, and it's a challenge to be simultaneously correct and comprehensible.)

Setup

As always, I start by importing sumbols and configuring pandas:

In [1]:
%load_ext autotime
import sys
sys.path.append("../..")
from gastrodon import *
from rdflib import *
import pandas as pd
pd.options.display.width=120
pd.options.display.max_colwidth=100

Representing Sets

Just to make things clear, I'll first show what happens if we simply link one topic to another with a predicate, without using either a Container or Collection. First, I create a model that represents the five boroughs of New York City

In [2]:
boros=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://example.com/> .

:New_York_City
   :boro :Manhattan,:Queens,:Brooklyn,:Bronx,:Staten_Island .
""")
time: 9 ms

Note that the comma is a shorthand notation that lets me write a number of statements that share the same predicate and object. A Graph implements __iter__, so I can get all of the facts in it like so:

In [3]:
list(boros.graph)
Out[3]:
[(rdflib.term.URIRef('http://example.com/New_York_City'),
  rdflib.term.URIRef('http://example.com/boro'),
  rdflib.term.URIRef('http://example.com/Queens')),
 (rdflib.term.URIRef('http://example.com/New_York_City'),
  rdflib.term.URIRef('http://example.com/boro'),
  rdflib.term.URIRef('http://example.com/Staten_Island')),
 (rdflib.term.URIRef('http://example.com/New_York_City'),
  rdflib.term.URIRef('http://example.com/boro'),
  rdflib.term.URIRef('http://example.com/Manhattan')),
 (rdflib.term.URIRef('http://example.com/New_York_City'),
  rdflib.term.URIRef('http://example.com/boro'),
  rdflib.term.URIRef('http://example.com/Brooklyn')),
 (rdflib.term.URIRef('http://example.com/New_York_City'),
  rdflib.term.URIRef('http://example.com/boro'),
  rdflib.term.URIRef('http://example.com/Bronx'))]
time: 12 ms

Just as there are five boroughs, there are five facts.

In [4]:
len(boros.graph)
Out[4]:
5
time: 12 ms

Now I make a LocalEndpoint which will re|nder SPARQL query results as pandas DataFrame(s)

In [5]:
boros.select("""
    SELECT ?boro { :New_York_City :boro ?boro}
""")
Out[5]:
boro
0 :Queens
1 :Staten_Island
2 :Manhattan
3 :Brooklyn
4 :Bronx
time: 1.84 s

Note that the order that the facts come back in the SPARQL query is random, because RDF doesn't remember the order in which statements were made. This is the right behavior in this case, because the boroughs do not come in any particular order, although we can order them alphabetically, by population, or some other metric, so long as we have the data in the graph and write the right SPARQL query:

In [6]:
boros.select("""
    SELECT ?boro 
        { :New_York_City :boro ?boro}
    ORDER BY ?boro
""")
Out[6]:
boro
0 :Bronx
1 :Brooklyn
2 :Manhattan
3 :Queens
4 :Staten_Island
time: 25.5 ms

Another characteristic of a set is that a given topic can only be listed once. For instance, if we repeat the same fact over and over again, RDF will only capture it once:

In [7]:
boros=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://example.com/> .

:New_York_City
   :boro :Manhattan,:Manhattan,:Manhattan .
""")

len(boros.graph)
Out[7]:
1
time: 8.01 ms
In [8]:
boros.select("""
    SELECT ?boro { :New_York_City :boro ?boro}
""")
Out[8]:
boro
0 :Manhattan
time: 38 ms

A simple sequence example

Next I create a list of three items. This RDF graph has four statements in it, one for the rdf:Seq type, and three for the members of the list.

In [9]:
sequence=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
[] a rdf:Seq ;
    rdf:_1 "Beginning" ;
    rdf:_2 "Middle" ;
    rdf:_3 "End" .
""").graph

len(sequence)
Out[9]:
4
time: 14 ms

rdflib comes with definitions for classes and predicates in common namespaces such as <http://www.w3.org/1999/02/22-rdf-syntax-ns#>; since traditional RDF tools won't give an error if you misspell a resource URI, you should use these to avoid mistakes.

(In most cases, if you misspell a predicate or resource name, your queries will get no results)

In [10]:
RDF.Seq,RDF.type
Out[10]:
(rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq'),
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'))
time: 15 ms

I used a blank node to 'name' the list, so I need a reference to the list to work with.

The rdflib Graph object (the sequence below) uses the Python Slice operator in an unusual way; the fields of the slice mean:

graph[subject:object:predicate]

so if I specify the object and predicate and leave the subject out, I get the matching subjects. The one function picks out the first and only element of the resulting list.

In [11]:
lhs=one(sequence[:RDF.type:RDF.Seq])
lhs
Out[11]:
rdflib.term.BNode('ub3bL2C1')
time: 14 ms
In [12]:
endpoint=LocalEndpoint(sequence)
time: 17 ms

Sometimes you might want to turn an RDF Container into a Python list so you can work on it with Python. You can do this with the decollect function.

In [13]:
endpoint.decollect(lhs)
Out[13]:
['Beginning', 'Middle', 'End']
time: 135 ms

Once you've converted a list to Python, you can take the length with the len function

In [14]:
len(endpoint.decollect(lhs))
Out[14]:
3
time: 101 ms

What if we want to write a SPARQL query to get the length? There isn't a SPARQL function to get the length of a list, but we can write our own. One thing I might try is counting the statements for which the container is the subject.

In [15]:
endpoint.select('''
    SELECT (COUNT(*) AS ?cnt) {
        ?s ?p ?o .
    }
''',bindings=dict(s=lhs))
Out[15]:
cnt
0 4
time: 22 ms

Close, but no cigar. I got four instead of three because it counted the statement that

[] a rdf:Seq

which has nothing to do with the list members. Thus, I need to write a query that skips this statement.

One way to do that is to apply the negation operator to remove any a statements.

In [16]:
endpoint.select('''
    SELECT (COUNT(*) AS ?cnt) {
        ?s ?p ?o .
        MINUS {?s a ?o}
    }
''')
Out[16]:
cnt
0 3
time: 31 ms

The above query works in this case, and works no matter how many types are associated with the container. Nothing stops people from adding more statements where the container is the subject, and in that case we'd get a count that's too high. The following query is better, because it selects exactly for predicates of the form rdf:_....

It's ugly however, and still not 100% compliant with the standard because only predicates of the form rdf:_{number} (where the number does not start with zero) are container membership predicates. On the other hand, if somebody writes RDF like that they are asking for trouble...

In [17]:
endpoint.select('''
    SELECT (COUNT(*) AS ?cnt) {
        ?s ?p ?o .
        FILTER(STRSTARTS(STR(?p),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
    } 
''')
Out[17]:
cnt
0 3
time: 61 ms

A more complex case

In [18]:
duo=inline(r"""
@prefix : <http://example.com/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

:simple a rdf:Seq ;
    rdf:_1 "uno" ;
    rdf:_2 "dos" ;
    rdf:_3 3 ;
    rdf:_4 <http://dbpedia.org/resource/4> .

:complex a rdf:Seq ;
    rdf:_1 [
        a rdf:Seq ;
            rdf:_1 33 ;
            rdf:_2 91 ;
            rdf:_3 15 
        ] ;
    rdf:_2 [
        a rdf:Seq ;
            rdf:_1 541 ;
            rdf:_2 3 
        ].    

""")
time: 7.99 ms

RDF and SPARQL let you look at lists in a different way from most languages. For instance, the following query finds all of the lists in the model and counts how many members each have. Two of the lists have URI names, the other two are the containers inside :complex which are represented as blank nodes.

In [19]:
duo.select("""
    SELECT ?s (COUNT(*) AS ?cnt) {
        ?s ?p ?o .
        FILTER(STRSTARTS(STR(?p),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
    } GROUP BY ?s
""")
Out[19]:
s cnt
0 :simple 4
1 ub4bL11C12 3
2 :complex 2
3 ub4bL17C12 2
time: 93 ms

Another kind of query you can write looks for all the containers that contain a certain value, for instance, the number 3.

In [20]:
duo.select("""
    SELECT ?s {
        ?s ?p 3 .
        FILTER(STRSTARTS(STR(?p),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
    }
""")
Out[20]:
s
0 :simple
1 ub4bL17C12
time: 43 ms

It starts getting ugly though, if you want to write a query that involves more than one list, say, lists that are nested. For instance, to list the items in :complex (not in order), the query is

In [21]:
duo.select("""
    SELECT ?member {
        :complex ?p1 ?innerList .
        ?innerList ?p2 ?member .
        FILTER(STRSTARTS(STR(?p1),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
        FILTER(STRSTARTS(STR(?p2),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
    }
""")
Out[21]:
member
0 15
1 33
2 91
3 3
4 541
time: 96 ms

Decollecting a single list from the model is simple; the decollect method automatically converts RDF terms into native Python data types (strings and integers)

In [22]:
duo.decollect(URIRef("http://example.com/simple"))
Out[22]:
['uno', 'dos', 3, 'http://dbpedia.org/resource/4']
time: 80 ms

Counting to Eleven

This next example tests the decollect code; particularly it checks that we sort the container membership properties in numerical order (as opposed to alphabetic, which would go [1,10,11,2,...]

In [23]:
sequence_11=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://example.com/> .

:s11 a rdf:Seq ;
    rdf:_1 "one" ;
    rdf:_2 "two" ;
    rdf:_3 "three" ;
    rdf:_4 "four" ;
    rdf:_5 "five" ;
    rdf:_6 "six" ;
    rdf:_7 "seven" ;
    rdf:_8 "eight" ;
    rdf:_9 "nine" ;
    rdf:_10 "ten" ;
    rdf:_11 "eleven" .
""")
time: 5 ms
In [24]:
goes_to_eleven=sequence_11.decollect(URIRef("http://example.com/s11"))
goes_to_eleven
Out[24]:
['one',
 'two',
 'three',
 'four',
 'five',
 'six',
 'seven',
 'eight',
 'nine',
 'ten',
 'eleven']
time: 139 ms
In [25]:
assert goes_to_eleven[0]=="one"
assert goes_to_eleven[1]=="two"
assert goes_to_eleven[10]=="eleven"
assert len(goes_to_eleven)
time: 999 µs

Just to see how you could get it wrong, the following query gives the wrong answer because RDF resources sort in alphabetical order:

In [26]:
sequence_11.select("""
   SELECT ?member {
      :s11 ?index ?member
      FILTER(STRSTARTS(STR(?index),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
   } ORDER BY(?index)
""")
Out[26]:
member
0 one
1 ten
2 eleven
3 two
4 three
5 four
6 five
7 six
8 seven
9 eight
10 nine
time: 71 ms

Don't be that guy!

To get the right order, convert the indexes to numbers, like so:

In [27]:
sequence_11.select("""
   SELECT ?member {
      :s11 ?index ?member
      FILTER(STRSTARTS(STR(?index),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
      BIND(xsd:integer(SUBSTR(STR(?index),45)) AS ?number)
   } ORDER BY(?number)
""")
Out[27]:
member
0 one
1 two
2 three
3 four
4 five
5 six
6 seven
7 eight
8 nine
9 ten
10 eleven
time: 90 ms

Bags

RDF has two additional container types, rdf:Bag and rdf:Alt. In both cases we have enough information to return an ordered list, but rdf:Bag is a statement that the order of the elements does not matter, while rdf:Alt is a statement that the application should choose one of the values. (An example of that could be labels in different languages)

decollect treats an rdf:Alt the same way it treats an rdf:Seq, and it would treat a container that is missing like a list too. (I'll leave it you which alternative you want to choose) decollect has special treatment for bags. For an example, I make a bag that contains the words from a sentence:

In [28]:
laurie=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://example.com/> .

:from_the_air a rdf:Bag ;
    rdf:_1 "this" ;
    rdf:_2 "is" ;
    rdf:_3 "the" ;
    rdf:_4 "time" ;
    rdf:_5 "and" ;
    rdf:_6 "this" ;
    rdf:_7 "is" ;
    rdf:_8 "the" ;
    rdf:_9 "record" ;
    rdf:_10 "of" ;
    rdf:_11 "the" ;
    rdf:_12 "time" .
""")
time: 4 ms

Python has a built-in collection type called Counter which intended to represent bags, so decollect converts a bag to a Counter, giving us a nice example of a "bag of words".

In [29]:
laurie.decollect(URIRef("http://example.com/from_the_air"))
Out[29]:
Counter({'and': 1,
         'is': 2,
         'of': 1,
         'record': 1,
         'the': 3,
         'this': 2,
         'time': 2})
time: 99 ms

Array Index Offsets

Some people argue over the Red Sox and the Yankees. Some people argue whether you should break an egg on the big end or the little end. Microprocessor designers make different choices about the order of bytes in larger words.

One thing programming languages have long disagreed about is how to define array indexes.

In FORTRAN, the first element of an array has an index of 1. In C, the first element of an array has an index of 0. Neither one is right or wrong, but these are two different conventions that you will encounter.

As seen in the examples above, the native indexes of RDF Containers start at one. On the other hand, Python lists start at zero:

In [30]:
x=["first","second","third"]
x[0]
Out[30]:
'first'
time: 3 ms

nothing prevents a library in a language like Python from indexing lists any way it wants, but Pandas frames behave like Python lists, both in how they are displayed

In [31]:
x=pd.DataFrame([25,"or",6,2,4])
x
Out[31]:
0
0 25
1 or
2 6
3 2
4 4
time: 22 ms

in how in they are accessed:

In [32]:
x.at[0,0]
Out[32]:
25
time: 17 ms

It's an obvious and very possible mistake that you could want to access (say) the third item of a list, and get confused as to it being a Python list where one would write

x[2]

as opposed to a SPARQL query in which one would write

?list rdf:_3 ?member .

There are times when one might have a Python variable with a numeric value and want to use it to index a RDF collection, and that is why gastrodon has a member function

In [33]:
member(0)
Out[33]:
rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#_1')
time: 15 ms

Which can be used as follows:

In [34]:
idx=2   # the third word!
sequence_11.select("""
   SELECT ?word { :s11 ?index ?word . }
""",bindings=dict(index=member(idx)))
Out[34]:
word
0 three
time: 26 ms

Thus if you are writing Python in the Python world, member lets you use Python indexing.

Conclusion

This article is a quick introduction to RDF Containers, one of two ways of representing ordered collections in RDF. I covered both the RDF Containers themselves, and also the facilities in gastrodon to work with them to do tasks such as:

  • turn a RDF :Seq or :Alt to an ordinary Python list
  • turn an RDF :Bag into a Python Counter
  • write SPARQL queries that
    • count the members of a list
    • iterate over the members of a list
    • work with multiple-leveled lists, and
    • select individual members

this notebook is part of the unit tests for gastrodon and is part of a series.

This article is part of a series.
Subscribe to my mailing list to be notified when new installments come out.