In this notebook I will
RDF Containers are one of two mechanisms to represent ordered collections in RDF:
If you're developing a new application, you may need to choose to use one or the other for a particular use. From the viewpoint of this article, however, I'm working with a dataset that already uses one or the other, and the goal is to write queries againt it.
(Note: the words "container" and "collection" are usedly loosely in things you will read, for instance, people might refer to the Python list as being a "container" or being a "collection". A java List
is a subclass of Collection
, whereas the equivalent vector
in C++ is defined in the container
library, for instance. People writing natural languages are always going to be ambiguous, and it's a challenge to be simultaneously correct and comprehensible.)
As always, I start by importing sumbols and configuring pandas:
%load_ext autotime
import sys
sys.path.append("../..")
from gastrodon import *
from rdflib import *
import pandas as pd
pd.options.display.width=120
pd.options.display.max_colwidth=100
Just to make things clear, I'll first show what happens if we simply link one topic to another with a predicate, without using either a Container or Collection. First, I create a model that represents the five boroughs of New York City
boros=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://example.com/> .
:New_York_City
:boro :Manhattan,:Queens,:Brooklyn,:Bronx,:Staten_Island .
""")
Note that the comma is a shorthand notation that lets me write a number of statements that share the same predicate and object. A Graph implements __iter__
, so I can get all of the facts in it like so:
list(boros.graph)
Just as there are five boroughs, there are five facts.
len(boros.graph)
Now I make a LocalEndpoint
which will re|nder SPARQL query results as pandas DataFrame
(s)
boros.select("""
SELECT ?boro { :New_York_City :boro ?boro}
""")
Note that the order that the facts come back in the SPARQL query is random, because RDF doesn't remember the order in which statements were made. This is the right behavior in this case, because the boroughs do not come in any particular order, although we can order them alphabetically, by population, or some other metric, so long as we have the data in the graph and write the right SPARQL query:
boros.select("""
SELECT ?boro
{ :New_York_City :boro ?boro}
ORDER BY ?boro
""")
Another characteristic of a set is that a given topic can only be listed once. For instance, if we repeat the same fact over and over again, RDF will only capture it once:
boros=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://example.com/> .
:New_York_City
:boro :Manhattan,:Manhattan,:Manhattan .
""")
len(boros.graph)
boros.select("""
SELECT ?boro { :New_York_City :boro ?boro}
""")
Next I create a list of three items. This RDF graph has four statements in it, one for the rdf:Seq
type, and three for the members of the list.
sequence=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
[] a rdf:Seq ;
rdf:_1 "Beginning" ;
rdf:_2 "Middle" ;
rdf:_3 "End" .
""").graph
len(sequence)
rdflib comes with definitions for classes and predicates in common namespaces such as <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
; since traditional RDF tools won't give an error if you misspell a resource URI, you should use these to avoid mistakes.
(In most cases, if you misspell a predicate or resource name, your queries will get no results)
RDF.Seq,RDF.type
I used a blank node to 'name' the list, so I need a reference to the list to work with.
The rdflib Graph
object (the sequence below) uses the Python Slice operator in an unusual way; the fields of the slice mean:
graph[subject:object:predicate]
so if I specify the object and predicate and leave the subject out, I get the matching subjects. The one
function picks out the first and only element of the resulting list.
lhs=one(sequence[:RDF.type:RDF.Seq])
lhs
endpoint=LocalEndpoint(sequence)
Sometimes you might want to turn an RDF Container into a Python list so you can work on it with Python. You can do this with the decollect
function.
endpoint.decollect(lhs)
Once you've converted a list to Python, you can take the length with the len
function
len(endpoint.decollect(lhs))
What if we want to write a SPARQL query to get the length? There isn't a SPARQL function to get the length of a list, but we can write our own. One thing I might try is counting the statements for which the container is the subject.
endpoint.select('''
SELECT (COUNT(*) AS ?cnt) {
?s ?p ?o .
}
''',bindings=dict(s=lhs))
Close, but no cigar. I got four instead of three because it counted the statement that
[] a rdf:Seq
which has nothing to do with the list members. Thus, I need to write a query that skips this statement.
One way to do that is to apply the negation operator to remove any a
statements.
endpoint.select('''
SELECT (COUNT(*) AS ?cnt) {
?s ?p ?o .
MINUS {?s a ?o}
}
''')
The above query works in this case, and works no matter how many types are associated with the container. Nothing stops people from adding more statements where the container is the subject, and in that case we'd get a count that's too high. The following query is better, because it selects exactly for predicates of the form rdf:_...
.
It's ugly however, and still not 100% compliant with the standard because only predicates of the form rdf:_{number}
(where the number does not start with zero) are container membership predicates. On the other hand, if somebody writes RDF like that they are asking for trouble...
endpoint.select('''
SELECT (COUNT(*) AS ?cnt) {
?s ?p ?o .
FILTER(STRSTARTS(STR(?p),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
}
''')
duo=inline(r"""
@prefix : <http://example.com/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
:simple a rdf:Seq ;
rdf:_1 "uno" ;
rdf:_2 "dos" ;
rdf:_3 3 ;
rdf:_4 <http://dbpedia.org/resource/4> .
:complex a rdf:Seq ;
rdf:_1 [
a rdf:Seq ;
rdf:_1 33 ;
rdf:_2 91 ;
rdf:_3 15
] ;
rdf:_2 [
a rdf:Seq ;
rdf:_1 541 ;
rdf:_2 3
].
""")
RDF and SPARQL let you look at lists in a different way from most languages. For instance, the following query finds all of the lists in the model and counts how many members each have. Two of the lists have URI names, the other two are the containers inside :complex
which are represented as blank nodes.
duo.select("""
SELECT ?s (COUNT(*) AS ?cnt) {
?s ?p ?o .
FILTER(STRSTARTS(STR(?p),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
} GROUP BY ?s
""")
Another kind of query you can write looks for all the containers that contain a certain value, for instance, the number 3.
duo.select("""
SELECT ?s {
?s ?p 3 .
FILTER(STRSTARTS(STR(?p),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
}
""")
It starts getting ugly though, if you want to write a query that involves more than one list, say, lists that are nested. For instance, to list the items in :complex
(not in order), the query is
duo.select("""
SELECT ?member {
:complex ?p1 ?innerList .
?innerList ?p2 ?member .
FILTER(STRSTARTS(STR(?p1),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
FILTER(STRSTARTS(STR(?p2),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
}
""")
Decollecting a single list from the model is simple; the decollect method automatically converts RDF terms into native Python data types (strings and integers)
duo.decollect(URIRef("http://example.com/simple"))
This next example tests the decollect code; particularly it checks that we sort the container membership properties in numerical order (as opposed to alphabetic, which would go [1,10,11,2,...]
sequence_11=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://example.com/> .
:s11 a rdf:Seq ;
rdf:_1 "one" ;
rdf:_2 "two" ;
rdf:_3 "three" ;
rdf:_4 "four" ;
rdf:_5 "five" ;
rdf:_6 "six" ;
rdf:_7 "seven" ;
rdf:_8 "eight" ;
rdf:_9 "nine" ;
rdf:_10 "ten" ;
rdf:_11 "eleven" .
""")
goes_to_eleven=sequence_11.decollect(URIRef("http://example.com/s11"))
goes_to_eleven
assert goes_to_eleven[0]=="one"
assert goes_to_eleven[1]=="two"
assert goes_to_eleven[10]=="eleven"
assert len(goes_to_eleven)
Just to see how you could get it wrong, the following query gives the wrong answer because RDF resources sort in alphabetical order:
sequence_11.select("""
SELECT ?member {
:s11 ?index ?member
FILTER(STRSTARTS(STR(?index),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
} ORDER BY(?index)
""")
Don't be that guy!
To get the right order, convert the indexes to numbers, like so:
sequence_11.select("""
SELECT ?member {
:s11 ?index ?member
FILTER(STRSTARTS(STR(?index),"http://www.w3.org/1999/02/22-rdf-syntax-ns#_"))
BIND(xsd:integer(SUBSTR(STR(?index),45)) AS ?number)
} ORDER BY(?number)
""")
RDF has two additional container types, rdf:Bag
and rdf:Alt
. In both cases we have enough information to return an ordered list, but rdf:Bag
is a statement that the order of the elements does not matter, while rdf:Alt
is a statement that the application should choose one of the values. (An example of that could be labels in different languages)
decollect
treats an rdf:Alt
the same way it treats an rdf:Seq
, and it would treat a container that is missing like a list too. (I'll leave it you which alternative you want to choose) decollect
has special treatment for bags. For an example, I make a bag that contains the words from a sentence:
laurie=inline(r"""
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://example.com/> .
:from_the_air a rdf:Bag ;
rdf:_1 "this" ;
rdf:_2 "is" ;
rdf:_3 "the" ;
rdf:_4 "time" ;
rdf:_5 "and" ;
rdf:_6 "this" ;
rdf:_7 "is" ;
rdf:_8 "the" ;
rdf:_9 "record" ;
rdf:_10 "of" ;
rdf:_11 "the" ;
rdf:_12 "time" .
""")
Python has a built-in collection type called Counter
which intended to represent bags, so decollect
converts a bag to a Counter
, giving us a nice example of a "bag of words".
laurie.decollect(URIRef("http://example.com/from_the_air"))
Some people argue over the Red Sox and the Yankees. Some people argue whether you should break an egg on the big end or the little end. Microprocessor designers make different choices about the order of bytes in larger words.
One thing programming languages have long disagreed about is how to define array indexes.
In FORTRAN, the first element of an array has an index of 1. In C, the first element of an array has an index of 0. Neither one is right or wrong, but these are two different conventions that you will encounter.
As seen in the examples above, the native indexes of RDF Containers start at one. On the other hand, Python lists start at zero:
x=["first","second","third"]
x[0]
nothing prevents a library in a language like Python from indexing lists any way it wants, but Pandas frames behave like Python lists, both in how they are displayed
x=pd.DataFrame([25,"or",6,2,4])
x
in how in they are accessed:
x.at[0,0]
It's an obvious and very possible mistake that you could want to access (say) the third item of a list, and get confused as to it being a Python list where one would write
x[2]
as opposed to a SPARQL query in which one would write
?list rdf:_3 ?member .
There are times when one might have a Python variable with a numeric value and want to use it to index a RDF collection, and that is why gastrodon has a member
function
member(0)
Which can be used as follows:
idx=2 # the third word!
sequence_11.select("""
SELECT ?word { :s11 ?index ?word . }
""",bindings=dict(index=member(idx)))
Thus if you are writing Python in the Python world, member
lets you use Python indexing.
This article is a quick introduction to RDF Containers, one of two ways of representing ordered collections in RDF. I covered both the RDF Containers themselves, and also the facilities in gastrodon
to work with them to do tasks such as:
:Seq
or :Alt
to an ordinary Python list:Bag
into a Python Counter
this notebook is part of the unit tests for gastrodon and is part of a series.
This article is part of a series. Subscribe to my mailing list to be notified when new installments come out. |