Getting started with the DBpedia SPARQL endpoint

In this episode I begin to explore the DBpedia public SPARQL endpoint. I'll go through the following stages

  • setting up tools
  • counting triples
  • counting predicates
  • examination of the predicate pr:skipperlastname
  • countings classes
  • examination of the class on:CareerStation

My method is a deliberate combination of systematic analysis (looking at counts, methods that can applied to arbitrary predicates or classes) and opportunism (looking at topics that catch my eye.) DBpedia is too heterogenous to characterize in one article, but I'll begin to uncover the dark art of writing SPARQL queries against generic databases.

Setting up tools

A first step is to import a number of symbols that we'll use to do SPARQL queries and visualize the result

In [1]:
%load_ext autotime
import sys
sys.path.append("../..")
from gastrodon import RemoteEndpoint,QName,ttl,URIRef,inline
import pandas as pd
pd.options.display.width=120
pd.options.display.max_colwidth=100

First I'll define a few prefixes for namespaces that I want to use.

In [2]:
prefixes=inline("""
    @prefix : <http://dbpedia.org/resource/> .
    @prefix on: <http://dbpedia.org/ontology/> .
    @prefix pr: <http://dbpedia.org/property/> .
""").graph
time: 9 ms

Next I set up a SPARQL endpoint and register the above prefixes so I can use them; it is also important that I set the default graph and base_uri so we'll get good looking short results.

In [3]:
endpoint=RemoteEndpoint(
    "http://dbpedia.org/sparql/"
    ,default_graph="http://dbpedia.org"
    ,prefixes=prefixes
    ,base_uri="http://dbpedia.org/resource/"
)
time: 3 ms

Counting Triples

First I count how many triples there are in the main graph

In [4]:
count=endpoint.select("""
    SELECT (COUNT(*) AS ?count) { ?s ?p ?o .}
""").at[0,"count"]
count
Out[4]:
438336517
time: 2.58 s

Counting Predicates

For the next query I make a list of common predicates; note that there are a whole lot of them! The public SPARQL endpoint has a limit of 10,000 returned rows and we are finding many more than that.

Each predicate is a relationship between a topic and either another topic or a literal value. For instance, the rdf:type predicate links a topic to another topic representing a class that the first topic is an instance, for instance:

<Alan_Alda> rdf:type on:Person .

rdfs:label, on the other hand, links topics to literal values, such as

<Alan_Alda> rdfs:label 
                "Alan Alda"@en,
                "アラン・アルダ"@ja .

Strings in RDF (like the one above) are unusual compared to other computer languages because they can contain language tags, a particularly helpful feature for multilingual databases such as DBpedia.

In [5]:
predicates=endpoint.select("""
    SELECT ?p (COUNT(*) AS ?count) { ?s ?p ?o .} GROUP BY ?p ORDER BY DESC(?count)
""")
predicates
Out[5]:
count
p
rdf:type 113715893
http://www.w3.org/2002/07/owl#sameAs 33623696
http://purl.org/dc/terms/subject 23990506
rdfs:label 22430852
http://www.w3.org/ns/prov#wasDerivedFrom 15801285
on:wikiPageID 15797811
on:wikiPageRevisionID 15797811
http://purl.org/dc/elements/1.1/language 12845235
http://xmlns.com/foaf/0.1/primaryTopic 12845235
http://xmlns.com/foaf/0.1/isPrimaryTopicOf 12845234
rdfs:comment 12391811
on:abstract 12390578
on:wikiPageExternalLink 7772279
on:wikiPageRedirects 7632358
http://xmlns.com/foaf/0.1/name 4146581
http://purl.org/linguistics/gold/hypernym 4090049
http://www.w3.org/2004/02/skos/core#broader 3080338
http://purl.org/dc/elements/1.1/rights 2897004
on:team 2007122
on:birthDate 1740614
http://xmlns.com/foaf/0.1/depiction 1698618
on:thumbnail 1695460
pr:title 1566113
on:wikiPageDisambiguates 1537180
http://www.w3.org/2004/02/skos/core#prefLabel 1475015
http://xmlns.com/foaf/0.1/thumbnail 1448505
http://xmlns.com/foaf/0.1/gender 1418209
on:birthPlace 1330297
pr:subdivisionType 1321475
http://purl.org/dc/terms/description 1289109
... ...
pr:plantLatM 135
ns2:v4b 135
ns5:v1b 135
pr:affdate 135
on:dissolved 135
pr:game11Attendance 135
ns5:v2b 135
pr:game9Attendance 135
pr:dharmaName 135
pr:irishGridReference 135
pr:payloadCapacity 135
pr:ports 135
pr:seats1Next 135
pr:compartment 135
pr:powerout 135
pr:seats1Begin 135
pr:longMin 135
pr:rotAreaSqm 135
pr:plantLongM 135
pr:deacon 135
pr:singleTemperature 135
pr:anchor 135
pr:skipperlastname 135
pr:flavour 135
pr:bo 135
pr:buschCarTeam 135
pr:majorsites 135
ns1:v4b 135
ns10:a 135
pr:allamericans 134

10000 rows × 1 columns

time: 15.4 s

Some notes.

First of all, properties that are original to DBpedia. are in two namespaces; the on namespace contains DBpedia Ontology properties which are better organized (mapped manually) than the pr namespace that contains properties that are mapped automatically. The select function returns short names for predicates in these namespaces because I specified them in the prefix list above.

DBpedia also uses predicates that are defined in other namespaces, such as foaf and dc. Frequently these duplicate properties that are defined in DBpedia, but facilitate interoperability with tools and data that use standard vocabularies. select would show you short names for these to if I added them to the prefix list, but I didn't, so it doesn't.

If you look closely, you might notice we got exactly 10,000 results from this last query. This is not because DBpedia uses only 10,000 distinct predicates, but because the DBpedia SPARQL endpoint has a 10,000 row result limit. This can be annoying sometimes, but it protects the endpoint from people who write crazy queries. There is a bag of tricks for dealing with this, but in the purposes of this article, 10,000 predicates is enough to get started.

This begs the question:

"How many distinct predicates are used in DBpedia?"

which is easy to answer with a SPARQL query:

In [6]:
endpoint.select("""
    SELECT (COUNT(*) AS ?count) { SELECT DISTINCT ?p { ?s ?p ?o .} }
""")
Out[6]:
count
0 60649
time: 304 ms

When you have a number of "things" ordered by how prevalent there are, a cumulative distribution function is a great nonparametric method of characterizing the statistics

In [7]:
predicates["dist"]=predicates["count"].cumsum()/count
time: 3.5 ms
In [8]:
%matplotlib inline
predicates["dist"].plot()
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x259958b4da0>
time: 519 ms

This distribution certainly looks like it has a "knee" somewhere in the teens, probably involving a transition from predicates that could apply to any topic such as rdfs:comment as opposed to predicates specific to certain subject areas, such as on:team.

In [9]:
predicates["dist"].head(100).plot()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x259958d2898>
time: 208 ms

Here are the top 20 predicates, representing more than 80% of the predicates used in the main graph

In [10]:
predicates.head(30)
Out[10]:
count dist
p
rdf:type 113715893 0.259426
http://www.w3.org/2002/07/owl#sameAs 33623696 0.336134
http://purl.org/dc/terms/subject 23990506 0.390864
rdfs:label 22430852 0.442037
http://www.w3.org/ns/prov#wasDerivedFrom 15801285 0.478085
on:wikiPageID 15797811 0.514126
on:wikiPageRevisionID 15797811 0.550166
http://purl.org/dc/elements/1.1/language 12845235 0.579471
http://xmlns.com/foaf/0.1/primaryTopic 12845235 0.608775
http://xmlns.com/foaf/0.1/isPrimaryTopicOf 12845234 0.638080
rdfs:comment 12391811 0.666350
on:abstract 12390578 0.694617
on:wikiPageExternalLink 7772279 0.712348
on:wikiPageRedirects 7632358 0.729760
http://xmlns.com/foaf/0.1/name 4146581 0.739220
http://purl.org/linguistics/gold/hypernym 4090049 0.748551
http://www.w3.org/2004/02/skos/core#broader 3080338 0.755578
http://purl.org/dc/elements/1.1/rights 2897004 0.762187
on:team 2007122 0.766766
on:birthDate 1740614 0.770737
http://xmlns.com/foaf/0.1/depiction 1698618 0.774612
on:thumbnail 1695460 0.778480
pr:title 1566113 0.782053
on:wikiPageDisambiguates 1537180 0.785560
http://www.w3.org/2004/02/skos/core#prefLabel 1475015 0.788925
http://xmlns.com/foaf/0.1/thumbnail 1448505 0.792230
http://xmlns.com/foaf/0.1/gender 1418209 0.795465
on:birthPlace 1330297 0.798500
pr:subdivisionType 1321475 0.801515
http://purl.org/dc/terms/description 1289109 0.804456
time: 14 ms

Looking at the tail, I find some very random sorts of properties.

In [11]:
predicates.tail()
Out[11]:
count dist
p
pr:buschCarTeam 135 0.998442
pr:majorsites 135 0.998442
ns1:v4b 135 0.998442
ns10:a 135 0.998442
pr:allamericans 134 0.998443
time: 17 ms

Here are predicates that are at the 90%, 95%, 98%, and 99% cumulative distributions, just to get a sense of what happens as things get more rare.

In [12]:
predicates[predicates["dist"]>0.9].head(1)
Out[12]:
count dist
p
on:areaTotal 179581 0.900161
time: 26.5 ms
In [13]:
predicates[predicates["dist"]>0.95].head(1)
Out[13]:
count dist
p
pr:nativeName 31226 0.95004
time: 22 ms
In [14]:
predicates[predicates["dist"]>0.98].head(1)
Out[14]:
count dist
p
pr:ordination 4839 0.980009
time: 19 ms
In [15]:
predicates[predicates["dist"]>0.99].head(1)
Out[15]:
count dist
p
pr:namedAfter 1765 0.990003
time: 25.5 ms

pr:skipperlastname (property ranked number 9993) caught my eye, so I take a look at it.

In [16]:
endpoint.select("""
    SELECT ?s ?o  { ?s pr:skipperlastname ?o  }
""")
Out[16]:
s o
0 <1989–90_Whitbread_Round_the_World_Race> English
1 <1993–94_Whitbread_Round_the_World_Race> Field
2 <1973–74_Whitbread_Round_the_World_Race> Goodwin
3 <1977–78_Whitbread_Round_the_World_Race> James
4 <1989–90_Whitbread_Round_the_World_Race> Smith
5 <1993–94_Whitbread_Round_the_World_Race> Smith
6 <1981–82_Whitbread_Round_the_World_Race> Taylor
7 <1985–86_Whitbread_Round_the_World_Race> Taylor
8 <Oryx_Quest> Thompson
9 <1973–74_Whitbread_Round_the_World_Race> Ainslie
10 <The_Race_(yachting_race)> Lewis
11 <1977–78_Whitbread_Round_the_World_Race> Watts
12 <Volvo_Baltic_Race> Williams
13 <1981–82_Whitbread_Round_the_World_Race> Williams
14 <1977–78_Whitbread_Round_the_World_Race> Williams
15 <1973–74_Whitbread_Round_the_World_Race> Williams
16 <1989–90_Whitbread_Round_the_World_Race> Dubois
17 <1989–90_Whitbread_Round_the_World_Race> Edwards
18 <Volvo_Baltic_Race> Mortensen
19 <1993–94_Whitbread_Round_the_World_Race> Riley
20 <1985–86_Whitbread_Round_the_World_Race> Salmon
21 <1989–90_Whitbread_Round_the_World_Race> Salmon
22 <1989–90_Whitbread_Round_the_World_Race> Dalton
23 <1993–94_Whitbread_Round_the_World_Race> Dalton
24 <The_Race_(yachting_race)> Dalton
25 <1977–78_Whitbread_Round_the_World_Race> Francis
26 <1977–78_Whitbread_Round_the_World_Race> Ridgway
27 <1993–94_Whitbread_Round_the_World_Race> Dickson
28 <1985–86_Whitbread_Round_the_World_Race> Berner
29 <1993–94_Whitbread_Round_the_World_Race> Conner
... ... ...
105 <1973–74_Whitbread_Round_the_World_Race> Laucht
106 <1985–86_Whitbread_Round_the_World_Race> Lugt
107 <1993–94_Whitbread_Round_the_World_Race> Maisto
108 <1981–82_Whitbread_Round_the_World_Race> Malingri
109 <1973–74_Whitbread_Round_the_World_Race> Malingri
110 <1989–90_Whitbread_Round_the_World_Race> Mallé
111 <1981–82_Whitbread_Round_the_World_Race> Mcgown-Fyfe
112 <1973–74_Whitbread_Round_the_World_Race> Myatt
113 <1985–86_Whitbread_Round_the_World_Race> Norsk
114 <1981–82_Whitbread_Round_the_World_Race> Panada
115 <1973–74_Whitbread_Round_the_World_Race> Pascoli
116 <1973–74_Whitbread_Round_the_World_Race> Perlicki
117 <1973–74_Whitbread_Round_the_World_Race> Pienkawa
118 <1985–86_Whitbread_Round_the_World_Race> Péan
119 <1981–82_Whitbread_Round_the_World_Race> Rietschoten
120 <1977–78_Whitbread_Round_the_World_Race> Rietschoten
121 <1981–82_Whitbread_Round_the_World_Race> Stampi
122 <1981–82_Whitbread_Round_the_World_Race> Tabarly
123 <1985–86_Whitbread_Round_the_World_Race> Tabarly
124 <1989–90_Whitbread_Round_the_World_Race> Tabarly
125 <1993–94_Whitbread_Round_the_World_Race> Tabarly
126 <1973–74_Whitbread_Round_the_World_Race> Tabarly
127 <1981–82_Whitbread_Round_the_World_Race> Versluys
128 <1985–86_Whitbread_Round_the_World_Race> Versluys
129 <1981–82_Whitbread_Round_the_World_Race> Viant
130 <1977–78_Whitbread_Round_the_World_Race> Viant
131 <1973–74_Whitbread_Round_the_World_Race> Viant
132 <1985–86_Whitbread_Round_the_World_Race> Visiers
133 <1989–90_Whitbread_Round_the_World_Race> Wilkeri
134 <1985–86_Whitbread_Round_the_World_Race> Zehender-Mueller

135 rows × 2 columns

time: 576 ms

Looks like it has to do with sailing. It's not an area that I know much about, so I'll transclude the page describing one of the topics from Wikipedia so we can understand it.

In [17]:
from bs4 import BeautifulSoup
from IPython.display import display, HTML
from uritools import urijoin

def transclude(file):
    with open(file,"rt",encoding="utf8") as fp:
        soop = BeautifulSoup(fp,"html5lib")
    for a in soop.find_all("a"):
        a["href"]=urijoin("http://en.wikipedia.org/",a["href"])   
    return HTML(str(soop.body))
time: 115 ms
In [18]:
 transclude("The_Race.html")
Out[18]:

The Race (yachting race)

From Wikipedia, the free encyclopedia

The Race was a round-the-world sailing race starting in Barcelona, Spain on December 31, 2000. It was the first ever non-stop, no-rules, no-limits, round-the-world sailing event, with a $2 million US prize. It was organized by Bruno Peyron.

The stated objectives of this race were:

  • to unite the different maritime cultures of the world
  • to gather together the world's premiere yachtsmen and women in a common event
  • to promote creativity in ocean sailing
  • to ally high technology and the environment
  • to create the most spectacular and most prestigious fleet of offshore racers that sailing has ever seen

A second race was planned for 2004, but was cancelled amid controversy that Tracy Edwards had organized a competing event called Oryx Quest.

Results[edit]

The 2000–01 race was won by Club Med, skippered by Grant Dalton in 62d 6h 56' 33".

Pos Boat Crew Country Time
1 Club Med Dalton, Grant Grant Dalton  New Zealand 62d 6h 56m 33s
2 Innovation Explorer Peyron, Loick Loick Peyron & Skip Novak  France 64d 22h 32m 38s
3 Team Adventure Lewis, Cam Cam Lewis  United States 82d 20h 21m 02s
4 Warta Polpharma Paszke, Roman Roman Paszke  Poland 99d 12h 31m
5 Team Legato Bullimore, Tony Tony Bullimore  Great Britain 104d 20h 52m
- PlayStation Fossett, Steve Steve Fossett  United States DNF[a]
- Team Philips Goss, Pete Pete Goss  Great Britain DNS
  1. ^ Damaged and forced to withdraw on day 16

Legend: DNF – Did not finish; DNS – Did not start;

time: 47 ms

That wikipedia page is pretty informative, let's see what facts are in DBpedia concerning "The Race".

Because I set the base_uri when I the endpoint object, DBpedia resources (which largely correspond to Wikipedia pages) can be easily written using angle brackets. It would be tempting to create a namespace for them, but it turns out that SPARQL and Turtle let you write a wider range of characters insides brackets, as opposed to in a namespace. Particularly, the parenthesis in <The_Race_(yachting_race)> are legal, but dbpedia:The_Race_(yachting_race) is not allowed!

In [19]:
pd.options.display.max_rows=99
endpoint.select("""
    SELECT ?p ?o  {<The_Race_(yachting_race)> ?p ?o  }
""")
Out[19]:
p o
0 rdf:type on:SportsEvent
1 rdf:type http://dbpedia.org/class/yago/WikicatSailingRaces
2 rdf:type http://dbpedia.org/class/yago/WikicatSportsCompetitionsInSpain
3 rdf:type http://dbpedia.org/class/yago/Abstraction100002137
4 rdf:type http://dbpedia.org/class/yago/Contest107456188
5 rdf:type http://dbpedia.org/class/yago/Event100029378
6 rdf:type http://dbpedia.org/class/yago/PsychologicalFeature100023100
7 rdf:type http://dbpedia.org/class/yago/Race107472657
8 rdf:type http://dbpedia.org/class/yago/SocialEvent107288639
9 rdf:type http://dbpedia.org/class/yago/YagoPermanentlyLocatedEntity
10 rdfs:label The Race (yachting race)
11 rdfs:label The Race
12 rdfs:label The Race
13 rdfs:label The Race (vela)
14 rdfs:label The Race
15 rdfs:label The Race
16 rdfs:comment The Race : No Limit Around The World est une épreuve sportive imaginée et créée par Bruno Peyron...
17 rdfs:comment The Race war eine Hochseeregatta, die 2000/2001 auf Mehrrumpf-Segelyachten der G-Class - d. h. u...
18 rdfs:comment The Race : No Limit Around The World è stata una corsa a vela immaginata e creata da Bruno Peyro...
19 rdfs:comment The Race (fr. La Course du Millénaire) – regaty dookoła świata bez zawijania do portu, które odb...
20 rdfs:comment The Race was a round-the-world sailing race starting in Barcelona, Spain on December 31, 2000. I...
21 rdfs:comment The Race was de eerste non-stop, no-limits wedstrijd rond de wereld die startte op 31 december 2...
22 http://www.w3.org/2002/07/owl#sameAs http://www.wikidata.org/entity/Q1130959
23 http://www.w3.org/2002/07/owl#sameAs http://de.dbpedia.org/resource/The_Race
24 http://www.w3.org/2002/07/owl#sameAs http://fr.dbpedia.org/resource/The_Race
25 http://www.w3.org/2002/07/owl#sameAs http://it.dbpedia.org/resource/The_Race_(vela)
26 http://www.w3.org/2002/07/owl#sameAs http://pl.dbpedia.org/resource/The_Race
27 http://www.w3.org/2002/07/owl#sameAs http://wikidata.dbpedia.org/resource/Q1130959
28 http://www.w3.org/2002/07/owl#sameAs http://nl.dbpedia.org/resource/The_Race
29 http://www.w3.org/2002/07/owl#sameAs http://rdf.freebase.com/ns/m.070k42
30 http://www.w3.org/2002/07/owl#sameAs http://yago-knowledge.org/resource/The_Race_(yachting_race)
31 http://purl.org/dc/terms/subject <Category:Yachting_races>
32 http://purl.org/dc/terms/subject <Category:Round-the-world_sailing_competitions>
33 http://purl.org/dc/terms/subject <Category:2000_in_sailing>
34 on:wikiPageID 2280786
35 on:wikiPageRevisionID 660377883
36 on:wikiPageExternalLink http://www.cat-alist.com/therace.htm
37 http://xmlns.com/foaf/0.1/isPrimaryTopicOf http://en.wikipedia.org/wiki/The_Race_(yachting_race)
38 http://www.w3.org/ns/prov#wasDerivedFrom http://en.wikipedia.org/wiki/The_Race_(yachting_race)?oldid=660377883
39 on:abstract The Race : No Limit Around The World est une épreuve sportive imaginée et créée par Bruno Peyron...
40 on:abstract The Race was a round-the-world sailing race starting in Barcelona, Spain on December 31, 2000. I...
41 on:abstract The Race war eine Hochseeregatta, die 2000/2001 auf Mehrrumpf-Segelyachten der G-Class - d. h. u...
42 on:abstract The Race : No Limit Around The World è stata una corsa a vela immaginata e creata da Bruno Peyro...
43 on:abstract The Race (fr. La Course du Millénaire) – regaty dookoła świata bez zawijania do portu, które odb...
44 on:abstract The Race was de eerste non-stop, no-limits wedstrijd rond de wereld die startte op 31 december 2...
45 pr:nation FRA
46 pr:nation POL
47 pr:nation USA
48 pr:nation GBR
49 pr:nation NZL
50 pr:nationality yes
51 pr:pos 1
52 pr:pos 2
53 pr:pos 3
54 pr:pos 4
55 pr:pos 5
56 pr:pos
57 pr:time yes
58 pr:time DNF
59 pr:time 5381793.0
60 pr:time 5610758.0
61 pr:time 7158062.0
62 pr:time 8598660.0
63 pr:time 9060720.0
64 pr:time DNS
65 pr:boatname <Warta_Polpharma>
66 pr:boatname <Team_Philips>
67 pr:boatname <PlayStation_(yacht)>
68 pr:boatname <Doha_2006_(yacht)>
69 pr:boatname <Innovation_Explorer>
70 pr:boatname <Team_Adventure>
71 pr:boatname <Team_Legato>
72 pr:boatname yes
73 pr:dnf yes
74 pr:dns yes
75 pr:skipper yes
76 pr:skipper Skip Novak
77 pr:skipperfirstname Grant
78 pr:skipperfirstname Roman
79 pr:skipperfirstname Tony
80 pr:skipperfirstname Steve
81 pr:skipperfirstname Pete
82 pr:skipperfirstname Cam
83 pr:skipperfirstname Loick
84 pr:skipperlastname Lewis
85 pr:skipperlastname Dalton
86 pr:skipperlastname Bullimore
87 pr:skipperlastname Fossett
88 pr:skipperlastname Goss
89 pr:skipperlastname Paszke
90 pr:skipperlastname Peyron
91 http://purl.org/linguistics/gold/hypernym <Race>
time: 531 ms

What's the story here? Cells from the table have been converted into facts, but the order of the facts has been scrambled. We know that one of the boats finished in "5381793.0" seconds, and we know there was a boat named "Warta_Polpharma" and so forth, but we don't know which boats finished, which boats boats finished in what time, which boat had what skipper, etc.

This is not a limitation of RDF, but it is a common limitation of RDF-based systems in the "Linked Data" era, and it's historically been a problem in RDF.

The basic problem is that if we want to write a statement like the one on the first row of the HTML table, we end up having to write something like

<Some_Node>
  pr:pos 1 ;
  pr:boat <Club_Med> ;
  pr:skipper <Grant_Dalton> ;
  pr:nation "NZL" ;
  pr:time 5381793.0 .

<The_Race_(yaching_race)> pr:entry <Some_Node> .

the only hard part is determing a name for <Some_Node>. In the case of DBpedia, names are derived from URIs in Wikipedia, a formula that doesn't apply when we're talking about a concept that doesn't have a URI in Wikipedia. We can duck the problem of assigning a name by using a blank node (which states a node exists without giving a specific name) but that causes problems of its own which come from the difficulty of having something nameless in a distributed system. (What if I want to talk about a nameless entity that exists in DBpedia?)

For specific problems, it's possible, and often straightforward, to find ways to name nodes like <Some_Node>. However, it is hard to find a solution that pleases everybody, particularly when we are talking about a system which is decentralized, in which people would like names to be stable over time, etc.

With conflicting demands, it's no wonder that this area has not been standardized by the W3C, but it's great to see that DBpedia is making some progress in this area, which I'll show in the next section.

Classes

Note I started this analysis by looking first at the most commonly used predicates. If I was looking a SQL database, this would be like looking at a list of columns first, and if I was looking at an Object-Oriented program, it would be like looking at a list of methods and fields.

It would be much more common to look at tables first in SQL or classes first in Java, but RDF is different from SQL and Java, and it often makes sense to look at properties first.

For one thing, it is possible to write properties without defining any classes or categories, that is, the RDF statement

<SomeTopic> :hasNumber 1023 .

is self-sufficient and meaningful without knowledge that <SomeTopic> is a particular kind of topic. Thus, properties are more fundamental.

More practically, people get into more trouble with classes than they do with properties. Part of it is that people tend to argue more about classes (ex. can a video game be art?) than they do about properties (ex. "Hideo Kojima was thge director of Metal Gear Solid") In the case of DBpedia, one problem is the sheer number of categories:

In [20]:
types=endpoint.select("""
    SELECT ?type (COUNT(*) AS ?count) { ?s a ?type .} GROUP BY ?type ORDER BY DESC(?count)
""")
types
Out[20]:
count
type
http://xmlns.com/foaf/0.1/Document 12856178
http://www.w3.org/2002/07/owl#Thing 5044222
on:Image 2897004
http://dbpedia.org/class/yago/PhysicalEntity100001930 2822488
http://dbpedia.org/class/yago/Object100002684 2720458
http://dbpedia.org/class/yago/YagoLegalActorGeo 2190190
http://dbpedia.org/class/yago/Whole100003553 2061271
on:Person 1818074
http://dbpedia.org/class/yago/YagoPermanentlyLocatedEntity 1654844
http://dbpedia.org/class/yago/YagoLegalActor 1548330
on:Agent 1546264
http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Agent 1529881
http://www.wikidata.org/entity/Q24229398 1529881
http://www.w3.org/2004/02/skos/core#Concept 1475015
http://dbpedia.org/class/yago/LivingThing100004258 1366065
http://dbpedia.org/class/yago/Organism100004475 1365758
http://www.wikidata.org/entity/Q215627 1243399
http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#NaturalPerson 1243399
http://schema.org/Person 1243399
http://xmlns.com/foaf/0.1/Person 1243399
http://www.wikidata.org/entity/Q5 1243399
http://dbpedia.org/class/yago/CausalAgent100007347 1229049
http://dbpedia.org/class/yago/Person100007846 1216438
on:TimePeriod 1127706
http://dbpedia.org/class/yago/Abstraction100002137 1079890
http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing 996625
http://dbpedia.org/class/yago/YagoGeoEntity 989272
on:CareerStation 977023
on:Place 881597
http://schema.org/Place 839987
on:Location 839987
http://dbpedia.org/class/yago/Artifact100021939 689249
http://dbpedia.org/class/yago/WikicatLivingPeople 659092
http://dbpedia.org/class/yago/Location100027167 644115
http://dbpedia.org/class/yago/Region108630985 620586
on:Settlement 581293
on:PopulatedPlace 516747
on:Work 508099
http://dbpedia.org/class/yago/District108552138 497592
http://schema.org/CreativeWork 496070
http://www.wikidata.org/entity/Q386724 496070
http://www.wikidata.org/entity/Q486972 478906
http://dbpedia.org/class/yago/AdministrativeDistrict108491826 468228
http://dbpedia.org/class/yago/GeographicalArea108574314 455794
http://dbpedia.org/class/yago/Group100031264 421838
http://dbpedia.org/class/yago/PsychologicalFeature100023100 407890
on:Athlete 392672
http://dbpedia.org/class/yago/SocialGroup107950920 373140
on:Organisation 352081
... ...
http://dbpedia.org/class/yago/WikicatCompaniesEstablishedIn1969 257
http://dbpedia.org/class/yago/Emotion107480068 257
http://dbpedia.org/class/yago/WikicatU.S.ArmyAll-AmericanBowlFootballPlayers 257
http://dbpedia.org/class/yago/WikicatEnglishRockDrummers 257
http://dbpedia.org/class/yago/WikicatMunicipalitiesOfTheProvinceOfZamora 257
http://dbpedia.org/class/yago/WikicatPeopleFromYokohama 257
http://dbpedia.org/class/yago/WikicatMexicanMaleFilmActors 257
http://dbpedia.org/class/yago/WikicatMexicanLawyers 257
http://dbpedia.org/class/yago/WikicatPublicHighSchoolsInMassachusetts 257
http://dbpedia.org/class/yago/WikicatŠKSlovanBratislavaPlayers 257
http://dbpedia.org/class/yago/WikicatAmericanFootballLeagueAll-StarPlayers 257
http://dbpedia.org/class/yago/WikicatAmericanRadioSportsAnnouncers 257
http://dbpedia.org/class/yago/WikicatAmericanLiteraryAwards 257
http://dbpedia.org/class/yago/WikicatBritishComedians 257
http://dbpedia.org/class/yago/WikicatEnglishTheatreDirectors 256
http://dbpedia.org/class/yago/WikicatVillagesInBlagoevgradProvince 256
http://dbpedia.org/class/yago/WikicatGreekPoliticians 256
http://dbpedia.org/class/yago/WikicatK.A.A.GentPlayers 256
http://dbpedia.org/class/yago/WikicatPeopleFromBratislava 256
http://dbpedia.org/class/yago/WikicatFantasy-comedyFilms 256
http://dbpedia.org/class/yago/WikicatMolecularBiologists 256
http://dbpedia.org/class/yago/Waterway109476331 256
http://dbpedia.org/class/yago/WikicatElvisPresleySongs 256
http://dbpedia.org/class/yago/WikicatRadioStationsEstablishedIn1947 256
http://dbpedia.org/class/yago/WikicatBrazilianExpatriatesInPortugal 256
http://dbpedia.org/class/yago/WikicatWarnerBros.RecordsArtists 256
http://dbpedia.org/class/yago/SearchEngine106578654 256
http://dbpedia.org/class/yago/WikicatEducationalInstitutionsEstablishedIn1954 256
http://dbpedia.org/class/yago/Smith110614629 256
http://dbpedia.org/class/yago/Body109224911 256
http://dbpedia.org/class/yago/WikicatCommunesOfCameroon 256
http://dbpedia.org/class/yago/WikicatRealMurciaFootballers 256
http://dbpedia.org/class/yago/WikicatMaleWestern(genre)FilmActors 256
http://dbpedia.org/class/yago/WikicatCitiesInLuxembourg 256
http://dbpedia.org/class/yago/Apostle109799461 256
http://dbpedia.org/class/yago/WikicatCharitiesBasedInLondon 256
http://dbpedia.org/class/yago/WikicatEducationalInstitutionsEstablishedIn1991 256
http://dbpedia.org/class/yago/WikicatDCComicsCharactersWithSuperhumanStrength 256
http://dbpedia.org/class/yago/WikicatVillagesInBielskCounty 256
http://dbpedia.org/class/yago/WikicatDisneyChannelShows 256
http://dbpedia.org/class/yago/WikicatCompaniesOfChina 256
http://dbpedia.org/class/yago/WikicatRomanianEssayists 256
http://dbpedia.org/class/yago/WikicatShelbourneF.C.Players 256
http://dbpedia.org/class/yago/WikicatEnglishWomenPoets 256
http://dbpedia.org/class/yago/Polyhedron113883885 256
http://dbpedia.org/class/yago/WikicatSwissFederalRailwaysStations 256
http://dbpedia.org/class/yago/WikicatAcademicsOfImperialCollegeLondon 256
http://dbpedia.org/class/yago/WikicatPopulatedPlacesInTheBoucleDuMouhounRegion 256
http://dbpedia.org/class/yago/WikicatPeopleFromSaxony 256

10000 rows × 1 columns

time: 16.5 s
In [21]:
endpoint.select("""
    SELECT (COUNT(*) AS ?count) { SELECT DISTINCT ?type { ?s a ?type .} }
""")
Out[21]:
count
0 483605
time: 287 ms

On average, that's nearly eight classes for every property!

DBpedia, it turns out, contains many types from YAGO, which are in turn generated from Wikipedia Categories and other data sources. Many of these classes such as yago:WikicatPeopleFromYokohama and yago:MexicanMaleFilmActors are classes that are members of very large families that include "People from Lanzarote", "Brazillian female professional wrestlers" as such. Two common patterns are:

  1. Restriction types: One could name "People from Yokohama" as a class, and ask for instances of that class. Alternatively, one could query for people for whom the property "comes from" has the value "Yokohama". A class whose membership is determined by property values is a "restriction type".
  2. Intersection types: "Mexican Person" is a class, "Male Person" is a class, "Film Actor" is a class. The set of topics which are members of all of those classes is "Mexican Male Film Actors".

As you can say the same things with or without restriction and intersection types, it is a case-by-case decision as to whether to use them or to compose them from other elements. What is clear, in this type, is that there are so many realized restriction and intersection types from YAGO that it gets in the way of seeing what kind of things are talked about in DBpedia.

An easy "set of blinders" to use here is to look only at types that are in the DBpedia Ontology namespace. Rather than write a new SPARQL query, I use the filtering operator in Pandas to pick out common types from the DBpedia Ontology.

In [22]:
types[types.index.str.startswith('on:')]
Out[22]:
count
type
on:Image 2897004
on:Person 1818074
on:Agent 1546264
on:TimePeriod 1127706
on:CareerStation 977023
on:Place 881597
on:Location 839987
on:Settlement 581293
on:PopulatedPlace 516747
on:Work 508099
on:Athlete 392672
on:Organisation 352081
on:SportsTeamMember 318735
on:OrganisationMember 318392
on:Species 306833
on:Eukaryote 302686
on:Village 231103
on:Animal 230175
on:MusicalWork 209142
on:ArchitecturalStructure 203065
on:PersonFunction 171413
on:SoccerPlayer 151207
on:Album 147917
on:Insect 146657
on:Film 129980
on:Building 123567
on:Company 109629
on:Infrastructure 92281
on:Artist 82757
on:Event 77583
on:OfficeHolder 71753
on:MusicalArtist 71014
on:TelevisionShow 70690
on:Station 69525
on:WrittenWork 69112
on:NaturalPlace 67863
on:Band 67831
on:Single 67150
on:Book 64239
on:Plant 62543
on:SocietalEvent 60321
on:MeanOfTransportation 59078
on:EducationalInstitution 55860
on:SportsSeason 55732
on:Software 52743
on:School 45445
on:Town 45415
on:SportsTeam 44978
on:BodyOfWater 44204
... ...
on:DartsPlayer 587
on:RugbyLeague 584
on:Chef 584
on:Winery 569
on:Jockey 556
on:MusicFestival 547
on:Skater 545
on:VoiceActor 543
on:Presenter 532
on:TableTennisPlayer 524
on:LawFirm 514
on:Rocket 501
on:Medician 501
on:FloweringPlant 499
on:AustralianFootballTeam 492
on:Moss 486
on:CyclingTeam 482
on:LacrossePlayer 482
on:SumoWrestler 473
on:Bodybuilder 472
on:SnookerPlayer 465
on:Photographer 459
on:Canoeist 458
on:AmateurBoxer 448
on:RoadJunction 448
on:Entomologist 437
on:Artery 428
on:SquashPlayer 422
on:Nerve 415
on:Racecourse 410
on:Pope 407
on:HandballTeam 393
on:GreenAlga 391
on:SolarEclipse 380
on:Database 363
on:RadioHost 359
on:Muscle 347
on:HorseTrainer 330
on:ClassicalMusicArtist 322
on:RoadTunnel 314
on:Poet 308
on:IceHockeyLeague 304
on:Brewery 292
on:Rower 279
on:BaseballSeason 275
on:PlayboyPlaymate 274
on:RaceTrack 269
on:NetballPlayer 263
on:CricketGround 260

388 rows × 1 columns

time: 30 ms

on:Image catches my eye, so I look at a few examples and pick one out.

In [23]:
endpoint.select("""
    SELECT ?that { 
        ?that a on:Image
    } LIMIT 10
""")
Out[23]:
that
0 http://en.wikipedia.org/wiki/Special:FilePath/Alfred_Schütz.jpg
1 http://en.wikipedia.org/wiki/Special:FilePath/Aromas.JPG
2 http://en.wikipedia.org/wiki/Special:FilePath/Baldwin_Park_CA_logo.jpg
3 http://en.wikipedia.org/wiki/Special:FilePath/Bayfield,CO.jpg
4 http://en.wikipedia.org/wiki/Special:FilePath/Bennettcoskyline.JPG
5 http://en.wikipedia.org/wiki/Special:FilePath/Bitterspring.jpg
6 http://en.wikipedia.org/wiki/Special:FilePath/Boulder_Creek.jpg
7 http://en.wikipedia.org/wiki/Special:FilePath/BrandonFL.gif
8 http://en.wikipedia.org/wiki/Special:FilePath/Buttonwillow.jpg
9 http://en.wikipedia.org/wiki/Special:FilePath/CarberryBookplate.jpg
time: 301 ms
In [24]:
HTML('<img src="{0}">'.format(_.at[0,'that']))
Out[24]:
time: 4 ms

These "topics" are what I would call "non-topic topics" in the sense that they are the subject of a statement, but not an actual "thing in the world" described by the knowledge base. (Wikipedia documents the outside world primarily, and only secondarily has a metadata catalog for items that are in it.)

The following query finds "plain ordinary topics"

In [25]:
endpoint.select("""
    SELECT ?that { 
        ?that a on:Person
    } LIMIT 10
""")
Out[25]:
that
0 <Andreas_Ekberg>
1 <Danilo_Tognon>
2 <Lorine_Livington_Pruette>
3 <Megan_Lawrence>
4 <Nikolaos_Ventouras>
5 <Peter_Ceffons>
6 <Sani_ol_molk>
7 <Siniša_Žugić>
8 <Strength_athlete>
9 <Trampolino_Gigante_Corno_d'Aola>
time: 303 ms

"Andres_Ekberg" is a shorthand for <http://dbpedia.org/resource/Andreas_Ekberg> which is parallel to the Wikipedia page at <http://en.wikipedia.org/wiki/Andres_Ekberg>. The select() method shows just "Andreas_Ekberg" because I registered <http://dbpedia.org/resource/> as the base URI of this endpoint when I created the endpoint object way back at the beginning of this notebook.

What most people would think of as "topics" in DBpedia live in the <http://dbpedia.org/resource/> namespace.

Another common kind of topic in DBpedia is the on:Agent:

In [26]:
endpoint.select("""
    SELECT ?that { 
        ?that a on:Agent
    } LIMIT 10
""")
Out[26]:
that
0 <3Com>
1 <7-Eleven>
2 <A._C._Bhaktivedanta_Swami_Prabhupada>
3 <Aardman_Animations>
4 <Aaron_Burr>
5 <Abbie_Hoffman>
6 <About.com>
7 <Abraham_Robinson>
8 <Abraham_de_Moivre>
9 <Academy_of_Motion_Picture_Arts_and_Sciences>
time: 313 ms

The "Agent" concept is connected with the shared attributes of individuals and organizations; I like to think that an "Agent" is something that can be the originator or recipient of a communication. If I remove people using the MINUS operator, only organizations remain.

In [27]:
endpoint.select("""
    SELECT ?that { 
        ?that a on:Agent
        MINUS {?that a on:Person}
    } LIMIT 10
""")
Out[27]:
that
0 <3Com>
1 <7-Eleven>
2 <Aardman_Animations>
3 <About.com>
4 <Academy_of_Motion_Picture_Arts_and_Sciences>
5 <Acorn_Computers>
6 <Activision>
7 <Ad_Lib,_Inc.>
8 <Adnams_Brewery>
9 <Aermacchi>
time: 300 ms

Unlike the classes I've shown so far, a on:TimePeriod can be either a topic or non-topic. Asking for just 10 time periods, I find that some of them correspond to calendar years:

In [28]:
endpoint.select("""
    SELECT ?that { 
        ?that a on:TimePeriod
    } LIMIT 10
""")
Out[28]:
that
0 <1>
1 <10>
2 <100>
3 <1000>
4 <1001>
5 <1002>
6 <1003>
7 <1004>
8 <1005>
9 <1006>
time: 302 ms
In [29]:
 transclude("1004.html")
Out[29]:

1004

From Wikipedia, the free encyclopedia

Year 1004 (MIV) was a leap year starting on Saturday (link will display the full calendar) of the Julian calendar.

Events[edit]

By place[edit]

Africa[edit]

Asia[edit]

Europe[edit]


Births[edit]

Deaths[edit]

time: 44 ms

If, however, I make a query that eliminates topics that start with a number, the query returns a large number of non-topics. Even though these resources are in the <http://dbpedia.org/resource/> namespace, they don't have corresponding Wikipedia pages.

In [30]:
endpoint.select("""
    SELECT ?that { 
        ?that a on:TimePeriod .
        FILTER(STRSTARTS(STR(?that),"http://dbpedia.org/resource/A"))
    } LIMIT 10
""")
Out[30]:
that
0 <A._M._A._Azeez__1>
1 <A._R._Colquhoun__1>
2 <Abbie_Wolanow__1>
3 <Abbie_Wolanow__2>
4 <Abbie_Wolanow__3>
5 <Abbie_Wolanow__4>
6 <Abbie_Wolanow__5>
7 <Abdul_Wahab_Khan__1>
8 <Adam_Wolanin__1>
9 <Adam_Wolanin__2>
time: 420 ms

Let's take a closer look. It seems that this record describes a time that a soccer player spent playing for a team (although unfortunately it doesn't say when this time began or ended):

In [31]:
endpoint.select("""
    BASE <http://dbpedia.org/resource/>
    SELECT ?p ?o { 
        <Abbie_Wolanow__1> ?p ?o .
    }
""")
Out[31]:
p o
0 rdf:type http://www.w3.org/2002/07/owl#Thing
1 rdf:type on:CareerStation
2 rdf:type on:TimePeriod
3 on:team <Hapoel_Tel_Aviv_F.C.>
time: 290 ms

This record is more complete, and shows how the career record can be linked to a time, as well as information about how the player performed:

In [32]:
endpoint.select("""
    BASE <http://dbpedia.org/resource/>
    SELECT ?p ?o { 
        <Abbie_Wolanow__5> ?p ?o .
    }
""")
Out[32]:
p o
0 rdf:type http://www.w3.org/2002/07/owl#Thing
1 rdf:type on:CareerStation
2 rdf:type on:TimePeriod
3 on:numberOfGoals 0
4 on:numberOfMatches 1
5 on:team <United_States_men's_national_soccer_team>
6 on:years 1961-01-01
time: 319 ms

Going to the right of the career station (finding objects for which it is the subject) we see the team, but we don't see the player. Going to the left, however (finding objects for which it is the subject) we see the player.

In [33]:
endpoint.select("""
    BASE <http://dbpedia.org/resource/>
    SELECT ?s ?p  { 
        ?s ?p <Abbie_Wolanow__5> .
    }
""")
Out[33]:
s p
0 <Abbie_Wolanow> on:careerStation
time: 308 ms

Thus this fragment of the RDF graph looks like:

and this a general pattern for how one might deal with situations where we want to say something more complex than "Abbie Wolanow played for the U.S. Men's National Soccer Team".

In terms of the source data, Career stations are much like the race entries in the yachting example in that a single page on Wikipedia contains a number of "sub-topics" that need to be referred to in order to keep together facts such as "this boat was the third finisher" and "Cam Lewis was the skipper of this boat"

The difference is that DBpedia identifies individual career stations while it does not indentify individual race entries.

Here is a survey of the different predicate types that are used to describe career stations. I was probably a bit unlucky to pick a player who didn't have on:years specified very often:

In [34]:
endpoint.select("""
    SELECT ?p (COUNT(*) AS ?count) { 
        ?that a on:CareerStation .
        ?that ?p ?o .
    } GROUP BY ?p ORDER BY DESC(?count)
""")
Out[34]:
count
p
rdf:type 2931158
on:team 941316
on:years 927710
on:numberOfGoals 647584
on:numberOfMatches 645122
on:title 12
on:filename 2
on:description 2
on:deathDate 1
on:birthDate 1
http://purl.org/dc/elements/1.1/type 1
on:country 1
http://xmlns.com/foaf/0.1/name 1
time: 351 ms

What sort of people have career stations? I count the career stations and get the following results:

In [35]:
pd.options.display.max_rows=20
has_cs_types=endpoint.select("""
    SELECT ?type (COUNT(*) AS ?count) { 
        ?station a on:CareerStation .
        ?who on:careerStation ?station .
        ?who a ?type .
    } GROUP BY ?type ORDER BY DESC(?count)
""")
has_cs_types[has_cs_types.index.str.startswith("on:")]
Out[35]:
count
type
on:Person 977021
on:Agent 977021
on:SoccerPlayer 869100
on:Athlete 823540
on:SoccerManager 197693
on:SportsManager 197451
on:IceHockeyPlayer 2079
on:Building 178
on:River 178
on:AmericanFootballPlayer 116
on:Organisation 108
time: 17.1 s

Career stations seem heavily weighted towards people who play soccer! The numbers above are hard to compare to other characteristics, however, because they are counting the career stations instead of the people. For instance, Abbie Wolanow is counted five times because he has five career stations.

With a slightly different query, I can count the actual number of people of various types who have career stations.

In [38]:
has_cs_types=endpoint.select("""
    SELECT ?type (COUNT(*) AS ?count) {
        { SELECT DISTINCT ?who {
            ?station a on:CareerStation .
            ?who on:careerStation ?station .
        } }
        ?who a ?type .
    } GROUP BY ?type ORDER BY DESC(?count)
""")
has_cs_types[has_cs_types.index.str.startswith("on:")]
Out[38]:
count
type
on:Person 135887
on:Agent 135887
on:SoccerPlayer 125421
on:Athlete 121552
on:SoccerManager 18652
on:SportsManager 18617
on:IceHockeyPlayer 268
on:Building 31
on:River 28
on:AmericanFootballPlayer 19
on:Cricketer 16
on:Organisation 14
on:FictionalCharacter 13
on:SportsTeam 9
time: 8.77 s

Note that the counts here do not need to add up to anything in particular, because it is possible for someone to be in more than one category at a time. For instance, we see the same count for on:Person and on:Agent as well as on:Athlete and on:SoccerPlayer because each soccer player is an athlete. I got suspicious, however, and found that if I added the number of soccer players to the number of soccer managers...

In [39]:
18617+117270
Out[39]:
135887
time: 3 ms

... and found they were equal! That suggests that all of the people with career stations are involved with soccer, and that on:SoccerPlayer and on:SoccerManager are mutually exclusive.

I test that mutually exclusive bit by counting the number of topics which are both soccer players and soccer managers:

In [40]:
endpoint.select("""
    SELECT (COUNT(*) AS ?count) {
        ?x a on:SoccerPlayer .
        ?x a on:SoccerManager .
    }
""")
Out[40]:
count
0 8192
time: 347 ms

Those two really are mutually exclusive.

This seems strange to me. I don't know much about soccer (I am from the U.S. after all!) but frequently coaches and team managers are former players in other sports, shoudn't they be in soccer?

I investigate just a bit more, first getting a sample of managers...

In [41]:
endpoint.select("""
    SELECT ?x {
        ?x a on:SoccerManager .
    } LIMIT 10
""") 
Out[41]:
x
0 <Alan_Shearer>
1 <Alex_Ferguson>
2 <Dennis_Bergkamp>
3 <Enzo_Scifo>
4 <Marco_van_Basten>
5 <Osvaldo_Ardiles>
6 <Ruud_Gullit>
7 <Walter_Winterbottom>
8 <Alejandro_Morera_Soto>
9 <Aleksandr_Smirnov_(footballer,_born_1968)>
time: 302 ms

... and then looking at the text description of one in particular:

In [42]:
endpoint.select("""
    SELECT ?comment  { 
       <http://dbpedia.org/resource/Alex_Ferguson> rdfs:comment ?comment .
       FILTER(LANG(?comment)='en')
    }
""").at[0,"comment"]
Out[42]:
'Sir Alexander Chapman "Alex" Ferguson, CBE (born 31 December 1941) is a former Scottish football manager and player who managed Manchester United from 1986 to 2013. He is regarded by many players, managers and analysts to be one of the greatest and most successful managers of all time.'
time: 481 ms

As I suspected, Alex Ferguson was a player who became a manager. These things are not mutually exclusive in the real world, although they are mutually exclusive in DBpedia.

It's a typical example of what you find when you look at "how things are" as opposed to "how things are supposed to be".

If it were up to me you'd be a soccer player if you'd ever played soccer and you'd be a manager if you'd ever managed a soccer team. On the other hand, I don't have my own database of thousands of soccer players (and managers!) so having to accept data in the format it is provided in is part of the price of "free" data.

Conclusion

In this article I began an investigation of data in DBpedia that particularly focused on two kinds of topics: race entries and the careers of soccer players. In the first case, information about different entries in the race are scrambled, because no subject is introduced for each entry. In the second case, DBpedia provides identifiers for "Career Stations" upon which it can state facts such as what team a person played on, for what time period, and so forth.

I hope very much that the "Career Station" is the future of DBpedia because there are many other things that can be modeled very similarly such as:

  • a person's educational career
  • the work career of a person who works for multiple employers over time
  • times in which a person has been a member of a band
  • locations of a concert tour
  • results of a series of sports events

Introduced in DBpedia 3.9, "Career Station" is relatively new. Other generic databases such as Freebase and Wikidata have used mechanisms such as compound value types and qualifiers to similar effect. Let's hope that the enthusiasm soccer fans have brought to DBpedia will carry over to other sports and endeavors!

This article is part of a series.
Subscribe to my mailing list to be notified when new installments come out.