This article is one of a series describing specific applications of the Real Semantics system. In particular, it describes how Real Semantics builds the Ontology2 Edition of Dbpedia 2015-10. In this particular case, we import RDF data directly from DBpedia into a triple store (OpenLink Virtuoso) without transformation. However, the sheer bulk of the data, and the time involved, forces us to use sophisticated automation to produce a quality product. Working together with the AWS Marketplace, we can provide a matched set of code, data and hardware that can have people working with a large RDF data set in just minutes.
We'll start this article by describing one of the challenges of Linked Data: that is, how to publish and consume data when the costs of data processing, storage, and transfer start to become significant. We explain how cloud publishing lets us square that circle, coupling the costs of handling data to the time and place where people need it. We finish off the business case by considering another case where cloud technology changes the economics of computing and can make formerly impossible things possible.
I (Paul Houle) started making cloud data products long before the development Real Semantics, so I talk about the history of those efforts and how they contributed to the design decisions behind henson, the component of Real Semantics that constructs data-rich applications on cloud servers. Although Real Semantics works just fine on an ordinary computer, it is nice to be able to call upon cluster and cloud resources when necessary, and essential to be able to package code and data reliably for deployment to end users. Finally, we discuss the differences between the AWS platform targeted by Real Semantics and alternatives such as Microsoft's Azure and Hyper-V, as well as container-based systems.
Big Data is a popular buzzword, but how many people are actually doing it? I got interested in the semantic web years ago, when I was making the site animalphotos.info; back then I was doing the obvious thing, making a list of animal species, then searching Flickr for pictures of the animals. I had a conversation with a Wikipedia admin, who turned me on to DBpedia. Between DBpedia and Amazon's Mechanical Turk I no longer needed to make a list or look at the photos, but instead I could import photographs with a structured and scalable process.
In this time period, I went from exploiting general purpose RDF data sources such as DBpedia with traditional tools to my current focus, which is using RDF tools to exploit traditional data sources. Still, at Ontology2 we use DBpedia and Freebase to organize and enrich traditional data sources.
People face a number of challenges using Linked Data sources, such as:
If you think these problems are bad for DBpedia, think of how hard it is to get a complete view of what's happening at a large corporation!
Understanding the data that is there is difficult with the "dereferencing" approach where you go to a URL like:
http://dbpedia.org/resource/Linked_Data
and then you get back a result that looks something like:
dbr:Linked_data a ns6:Concept , yago:CloudStandards , wikidata:Q188451 , dbo:TopicalConcept , yago:Abstraction100002137 , yago:Measure100033615 , dbo:Genre , owl:Thing, yago:Standard107260623 , yago:SystemOfMeasurement113577171 ; rdfs:comment "Linked data is een digitale methode voor het publiceren ... de techniek van HTTP-URI's en RDF."@nl , "O conceito ... explorar a Web de Dados."@pt , "In computing, linked data ... can be read automatically by computers."@en , "键连资料(又称:关联数据,英文: Linked data)... 但它们存在着关联。"@zh , "Le Web des données (Linked Data, en anglais) ... l'information également entre machines. "@fr , "In informatica i linked data ... e utilizzare dati provenienti da diverse sorgenti."@it , "En informática ... que puede ser leída automáticamente por ordenadores."@es , "Linked Data (связанные данные) ... распространять информацию в машиночитаемом виде."@ru , "Linked Open Data ... では構造化されたデータ同士をリンクさせることでコンピュータが利用可能な「データのウェブ」の構築を目指している。"@ja ; rdfs:label "Dati collegati"@it , "鍵連資料"@zh , "Web des données"@fr , "بيانات موصولة"@ar , "Linked data"@ru , "Linked data"@nl , "Linked Open Data"@ja , "Linked data"@en , "Linked data"@pt , "Datos enlazados"@es ; dbo:wikiPageExternalLink <http://demo.openlinksw.com/Demo/customers/CustomerID/ALFKI%23this> , ns26:LinkedData , <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3121711/> , <http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf>, <http://www.edwardcurry.org/publications/freitas_IC_12.pdf> , <http://knoesis.wright.edu/library/publications/iswc10_paper218.pdf> , <http://virtuoso.openlinksw.com/white-papers/> , <http://nomisma.org/> , <http://www.semantic-web.at/LOD-TheEssentials.pdf> , ns27:the-flap-of-a-butterfly-wing_b26808 , ns25:book , <http://linkeddata.org> , <http://www.ahmetsoylu.com/wp-content/uploads/2013/10/soylu_ICAE2012.pdf>, <http://www2008.org/papers/pdf/p1265-bizer.pdf> , <http://www.community-of-knowledge.de/beitrag/the-hype-the-hope-and-the-lod2-soeren-auer-engaged-in-the-next-generation-lod/> , <http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/> , <http://knoesis.org/library/resource.php?id=1718> , <http://www.scientificamerican.com/article.cfm?id=berners-lee-linked-data> , <http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkingOpenData.pdf> ; dbo:wikiPageID 11174052 ; dbo:wikiPageRevisionID 677967394 ; dct:subject dbc:Semantic_Web , dbc:Internet_terminology , dbc:World_Wide_Web , dbc:Cloud_standards , dbc:Data_management , dbc:Distributed_computing_architecture , dbc:Hypermedia ; owl:sameAs dbpedia-ja:Linked_Open_Data , dbpedia-ko:링크드_데이터 , dbpedia-el:Linked_Data , dbpedia-es:Datos_enlazados , dbpedia-it:Dati_collegati , dbpedia-nl:Linked_data , wikidata:Q515701 , dbr:Linked_data , dbpedia-pt:Linked_data , dbpedia-fr:Web_des_données , dbpedia-wikidata:Q515701 , <http://rdf.freebase.com/ns/m.02r2kb1> , dbpedia-eu:Datu_estekatuak , yago-res:Linked_data ; prov:wasDerivedFrom <http://en.wikipedia.org/wiki/Linked_data?oldid=677967394> ; foaf:isPrimaryTopicOf wikipedia-en:Linked_data .
Now this is pretty neat (particularly in that there is a lot of multilingual information), which is invaluable if you are working on global projects such as LEIs) but you are looking at the data through a peephole. You have no idea of what other records exist, what records link to this record, what predicates exist in the database, etc.
A popular response to that is the Public SPARQL Endpoint, which lets you write SPARQL queries against a data set. SPARQL is flexible, and you can write all kinds of exploratory queries. For instance, the following query finds topics that share a large number of predicate-object pairs with dbr:Diamond_Dogs
, a David Bowie album:
select ?s (COUNT(*) as ?cnt) { dbr:Diamond_Dogs ?p ?o . ?s ?p ?o . } GROUP BY ?s ORDER BY DESC(?cnt) LIMIT 10
and if you run this against the DBpedia Public SPARQL endpoint you get a very nice list of similar topics.
That particular query takes a few seconds to run, but it's easy to write a similar query that consumes more resources such as
select ?s (COUNT(*) as ?cnt) { dbr:David_Bowie ?p ?o . ?s ?p ?o . } GROUP BY ?s ORDER BY DESC(?cnt)
if you run that query on the public SPARQL endpoint (please don't), you'll get a much less nice result:
Virtuoso S1T00 Error SR171: Transaction timed out SPARQL query: select ?s (COUNT(*) as ?cnt) { dbr:David_Bowie ?p ?o . ?s ?p ?o . } GROUP BY ?s ORDER BY DESC(?cnt)
This is not just a problem with SPARQL, it's a problem that affects any API. If an API is simple and only allows you to do a limited number of things, the cost of running that API is predictable, so it can be offered for free or for sale at a specific price per API call. If an API lets you do arbitrarily complex queries, however, the cost of a query can vary by factors of a million or more, so resource limits must be applied.
An alternative to the public SPARQL endpoint is the private SPARQL endpoint. Here you install a triple store on your own computer, load data, and then run your own queries. People who follow this route run into two problems:
The AWS Marketplace lets us team up with Amazon Web Services to sell you a package of matching hardware, software and data. (See our product, the Ontology2 edition of Dbpedia 2015-10) It is much easier to automate the build process in the cloud, because we always start with an identical cloud server which has a fast connection to the net, as compared to an installer which would need to adapt to whatever the state of your desktop or server is.
Cloud computing became popular as quickly as it has because it builds upon things we're familiar with. For instance, in Amazon EC2, we're working with servers, disk volumes, virtual networks, and other artifacts that we'd find in any data center. Cloud migration is often a matter of moving applications that are running on real servers onto virtual servers, without a big change in the system architecture.
For ordinary IT applications, the primary benefit of the cloud is simplification of operations. In terms of economics, the capital cost of buying servers is replaced with an hourly rate that is all inclusive. If you're aggressive about lowering costs and keep your servers busy, you can definitely save money with dedicated servers but it's a lot of work. The largest and most margin-sensitive companies like Google and Facebook will run their own infrastructure, but for more and more customers, the convenience of the public cloud wins out.
There is a class of applications, however, that can function only in the cloud. John Schring gave a really great talk about how Respawn Games used Microsoft's Azure to support Titanfall, a groundbreaking multiplayer game. I'm going to summarize what he says here:
Up until that point, online shooter games used a peer-to-peer model, where one of the player's computers would be selected to run a game server. This was necessary because the economics did not work for a game based on dedicated servers. With dedicated servers, a game developer would need to buy racks and racks of servers before the game launches, guessing how many would be needed to support the game. Buy too many and it is a financial disaster, and buy too few, and dissatisfied players will kill the game with bad word of mouth. Although game developers can't predict how many copies of the game will sell or how many people will sell in the first week, it's predictable that the game will be played heavily when it first comes out, and then usage will drop off -- meaning that dedicated servers won't be efficiently utilized.
The peer-to-peer model is limiting, however, because the average gaming PC or game console isn't intended to be a server. For instance, most consumer internet connections are asymmetric, with much more bandwidth available for download rather than upload. A player's computer is busy playing the game, which limits the resources available to the game server. Traditionally, the complexity of the world in a multiplayer game is limited by this -- players might not be aware of how it is limited, but the size of the world and the complexity of interactions in it is sharply constrained.
Titanfall used the Azure cloud in a straightfoward but powerful way. When players joined the game, Respawn's system would launch new game servers in Azure and then tear them down when the game was done. This way, Respawn could put 12 human players (some operating giant robots) in a complex world populated with an even larger number of A.I. characters. Scalability works both ways: Respawn could handle the crush of launch day but still afford to run the game for years afterward with just a trickle of die-hard users, earning loyalty that has gamers waiting for Titanfall 2.
That's the kind of application we want to support with Real Semantics; you don't need a cloud account to use Real Semantics -- you get can work done with Real Semantics on any ordinary laptop or server, but if you ever have a job so big you need a server with 2TB of RAM and 128 CPU cores it is available at the push of a button.
By the Spring of 2012 I had tried a number of different ways of getting data out the Freebase quad dump. It was not that hard to figure out how to get little bits of data out here and there, but without any documentation about how Freebase's proprietary database worked, there wasn't a complete solution.
At that time, Freebase was of interest to people who were interested in DBpedia but who had problems with the quality of DBpedia data. Because DBpedia is based on information extraction from Wikipedia, errors in DBpedia can be fixed only by changing data in Wikipedia (which has to be done manually) or by editing the extraction rules. Mainstream releases of DBpedia occur every six months or so, which is not fast enough to have a closed feedback loop. (There is DBpedia Live, but this is incomplete and not always reliable) Freebase, on the other hand, accepted both human and automated edits, and had a quality control process in place that made the data more reliable.
In the Spring of 2012 I cracked the code of the Freebase quad dump, liberating it from their proprietary database. Once the system of naming was figured out, it was straightforward to convert their data dump into industry standard RDF that could be used with industry standard SPARQL databases. Then, Freebase released an updated quad dump every week, and my first interest was developing a sustainable system to do that conversion on a weekly basis.
Over time, however, things changed. Spurred by my work, Freebase discontinued the quad dump and came out with their own official RDF dump. Over the course of the next year, I developed a new generation of tools, namely the infovore framework aimed at the cleanup and purification of the offical RDF dump, since the official dump contained hundreds of millions of superfluous, duplicative, invalid and sometimes harmful facts. Working towards the goal of a product that "just works" for users, I started developing products for the AWS marketplace.
In principle, producing a product for the AWS Marketplace is a matter of producing an Amazon Machine Image (AMI). This can be done by installing the software on a cloud server and then making a snapshot.
If you had to do this just once or twice, it wouldn't be hard to do by hand, but I learned very quickly that one might need to produce images multiple times. For instance, Amazon has testers who test and approve Amazon Marketplace applications -- mostly they check to see that they can follow the instructions for a product to work, but they do this with different server types, in different availability zones. Like any other software tester, however, they'll send the product back to you if they find a problem. Unavoidably this costs a least a day in calendar time (a reason to do things right the first time) -- and it is especially expensive when rebuilding the server means waiting hours to load a large dataset.
The straw that broke the camel's back was the Fall 2014 Shellshock bug; this bug affected most Unix users, but was particularly a challenge for the AWS marketplace because Amazon forced vendors to produce updated machine images. I realized then, that one couldn't build a sustainable business around the AWS marketplace without an automated process for building machine images.
My first attempt at build automation was to use the popular Vagrant software from Hashicorp. Software developers have a way of beating on their machines pretty hard, which means our computers are often different from each other and our production servers in configuration. It can take a lot of work to simply stack up all of the pieces needed to get our work done, but variations between our dev machines can be a big source of stress when something "works for me" but not for somebody else, leading to errors which are difficult to reproduce.
I started usimg using Vagrant to build development environments, so using it to build machine images loaded with data was workable, despite the fact that Vagrant wasn't really designed for building machine images. (Hashicorp provides something called Packer for that.) The existence of two different tools was my first bit of disagreement with hashicorp; if I need a different configuration file to build both a development and a production environment, aren't we just going back to the bad old days where things were normally out of sync?
Over time, I came to see Vagrant more as part of the problem rather than part of the solution. For one thing, it had a number of features, such as virtual networking, that didn't work with AWS and that I didn't use anyway. I was doing almost all of my provisioning with the bash shell and not using the many built-in plug-ins to support Puppet, Chef, Salt, Ansible and many other configuration management systems that, seemed to me to be just another complex thing to learn. The final issue was that I found the Ruby-based internal DSL awkward, having to write obscure code like:
Vagrant.configure("2") do |config| config.vm.synced_folder ".", "/vagrant", disabled: true config.vm.provider :aws do |aws, override| aws.instance_type="c3.xlarge" end config.vm.provision "shell", inline: "echo 'source .bash.d/*.sh' >> .bashrc", privileged: true config.vm.provision "shell", inline: "mkdir .bash.d", privileged: false config.vm.provision "file", source: "~/.netrc", destination: ".netrc" config.vm.provision "shell", inline: "apt-get install -y python-dateutil" config.vm.provision "shell", inline: "apt-get install -y axel" ... config.vm.provision "shell", path: "install-virtuoso.sh", privileged: true config.vm.provision "shell", path: "add-virtuoso-service.sh", privileged: false ... end
when all I really wanted was a few lines of shell script. Small problems became really irksome because it would take five minutes for a server to start and scripts to run to discover that a file I'd copied into the system was rejected by the Unix server because it had Windows line endings. (A problem that the Vagrant developers think should be addressed by adding more compleity to your Vagrantfiles.) Up until Spring 2016 I was still using Vagrant to build machine images for the LEI Demo, but with new products in the AWS marketplace about to be developed, I saw that I didn't have time to waste on a build automation framework that didn't value my time.
Just as I was getting more serious about automation, bigger changes were happening in the world of generic databases. Google, which had bought Freebase, made the decision to shut it down, after incorporating Freebase data and technology into the proprietary Google Knowledge Graph. In the meantime, some interest has shifted to Wikidata, and DBpedia has been steadily improving. In October 2015 I also attended the first DBpedia meetup to be held in the U.S., which was at Stanford University.
As I was learning about popular interests in databases like DBpedia and Freebase, I was simultaneously (i) focused on exploiting traditional data sources with RDF tools and (ii) using data from sources such as DBpedia and Freebase to support that. Excited by the improvements in DBpedia 2015-10, I began the process of rolling out a new generation of AWS marketplace products based on Real Semantics packaging technology.
henson, named after the illustrious puppeteer, is the Real Semantics module that handles cloud packaging. Starting from a clean slate, henson is much faster and more reliable than our old system that used Vagrant for the kind of work that we do. Here is a diagram of the build process:
This process has the following steps:
cloud-init.log
and cloud-init-output.log
. henson applies both general-purpose and task-specific rules to determine if the build was successful before moving on the the image creation procedure; if the build fails, the ruleset extracts and display a small set of lines immediately around the failure point to help the operator quicjly diagnose the problem. Note that we can set henson up to do a different combinations of functions in different cases. If we were interested in using the server directly, for instance, instead of creating an image, that is straightfoward. henson can be used to create development servers and then later be used to create a product image. Unlike Vagrant and some other tools, henson puts developer productivity first -- by minimizing the possibility of making a mistake with a build, running the build as fast as possible, automatically diagnosing problems with the build, and notifying the operator of the status of the build immediately so that person's attention can be put completely on something else while waiting.
The Ontology2 Edition of DBpedia 2015-10 is simple to use; to make it simple to use, however, we need to extract complexity from the process of constructing the product. Cloud computing lets us: (i) package hardware, data, and code together for customers, (ii) choose a wide range of small and large hardware configuration and (iii) start every product build from a repeatable place. Contrasted with the Legal Entity Identifier demo, which uses henson's product-building ability to create a production server, with our DBpedia product, we create a machine image which can both deliver to customers and use to support additional projects, such as the LEI demo, that require a global database of concepts, places and names.