Case Study: The Ontology2 edition of DBpedia 2015-10

Overview

This article is one of a series describing specific applications of the Real Semantics system. In particular, it describes how Real Semantics builds the Ontology2 Edition of Dbpedia 2015-10. In this particular case, we import RDF data directly from DBpedia into a triple store (OpenLink Virtuoso) without transformation. However, the sheer bulk of the data, and the time involved, forces us to use sophisticated automation to produce a quality product. Working together with the AWS Marketplace, we can provide a matched set of code, data and hardware that can have people working with a large RDF data set in just minutes.

We'll start this article by describing one of the challenges of Linked Data: that is, how to publish and consume data when the costs of data processing, storage, and transfer start to become significant. We explain how cloud publishing lets us square that circle, coupling the costs of handling data to the time and place where people need it. We finish off the business case by considering another case where cloud technology changes the economics of computing and can make formerly impossible things possible.

I (Paul Houle) started making cloud data products long before the development Real Semantics, so I talk about the history of those efforts and how they contributed to the design decisions behind henson, the component of Real Semantics that constructs data-rich applications on cloud servers. Although Real Semantics works just fine on an ordinary computer, it is nice to be able to call upon cluster and cloud resources when necessary, and essential to be able to package code and data reliably for deployment to end users. Finally, we discuss the differences between the AWS platform targeted by Real Semantics and alternatives such as Microsoft's Azure and Hyper-V, as well as container-based systems.

Linked Data and it's discontents

Big Data is a popular buzzword, but how many people are actually doing it? I got interested in the semantic web years ago, when I was making the site animalphotos.info; back then I was doing the obvious thing, making a list of animal species, then searching Flickr for pictures of the animals. I had a conversation with a Wikipedia admin, who turned me on to DBpedia. Between DBpedia and Amazon's Mechanical Turk I no longer needed to make a list or look at the photos, but instead I could import photographs with a structured and scalable process.

In this time period, I went from exploiting general purpose RDF data sources such as DBpedia with traditional tools to my current focus, which is using RDF tools to exploit traditional data sources. Still, at Ontology2 we use DBpedia and Freebase to organize and enrich traditional data sources.

People face a number of challenges using Linked Data sources, such as:

If you think these problems are bad for DBpedia, think of how hard it is to get a complete view of what's happening at a large corporation!

Understanding the data that is there is difficult with the "dereferencing" approach where you go to a URL like:

    http://dbpedia.org/resource/Linked_Data

and then you get back a result that looks something like:

dbr:Linked_data
  a ns6:Concept , yago:CloudStandards , wikidata:Q188451 , dbo:TopicalConcept , yago:Abstraction100002137 ,
    yago:Measure100033615 , dbo:Genre , owl:Thing, yago:Standard107260623 , yago:SystemOfMeasurement113577171 ;
  rdfs:comment
    "Linked data is een digitale methode voor het publiceren ... de techniek van HTTP-URI's en RDF."@nl ,
    "O conceito ... explorar a Web de Dados."@pt ,
    "In computing, linked data ... can be read automatically by computers."@en ,
    "键连资料(又称:关联数据,英文: Linked data)... 但它们存在着关联。"@zh ,
    "Le Web des données (Linked Data, en anglais)  ... l'information également entre machines. "@fr ,
    "In informatica i linked data ...  e utilizzare dati provenienti da diverse sorgenti."@it ,
    "En informática ... que puede ser leída automáticamente por ordenadores."@es ,
    "Linked Data (связанные данные) ...  распространять информацию в машиночитаемом виде."@ru ,
    "Linked Open Data ... では構造化されたデータ同士をリンクさせることでコンピュータが利用可能な「データのウェブ」の構築を目指している。"@ja ;
  rdfs:label                "Dati collegati"@it , "鍵連資料"@zh , "Web des données"@fr , "بيانات موصولة"@ar ,
    "Linked data"@ru , "Linked data"@nl , "Linked Open Data"@ja , "Linked data"@en , "Linked data"@pt ,
    "Datos enlazados"@es ;
    dbo:wikiPageExternalLink  <http://demo.openlinksw.com/Demo/customers/CustomerID/ALFKI%23this> , ns26:LinkedData ,
    <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3121711/> , <http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf>,
    <http://www.edwardcurry.org/publications/freitas_IC_12.pdf> , <http://knoesis.wright.edu/library/publications/iswc10_paper218.pdf> ,
    <http://virtuoso.openlinksw.com/white-papers/> , <http://nomisma.org/> , <http://www.semantic-web.at/LOD-TheEssentials.pdf> ,
    ns27:the-flap-of-a-butterfly-wing_b26808 , ns25:book , <http://linkeddata.org> ,
    <http://www.ahmetsoylu.com/wp-content/uploads/2013/10/soylu_ICAE2012.pdf>, <http://www2008.org/papers/pdf/p1265-bizer.pdf> ,
    <http://www.community-of-knowledge.de/beitrag/the-hype-the-hope-and-the-lod2-soeren-auer-engaged-in-the-next-generation-lod/> ,
    <http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/> , <http://knoesis.org/library/resource.php?id=1718> ,
    <http://www.scientificamerican.com/article.cfm?id=berners-lee-linked-data> ,
    <http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkingOpenData.pdf> ;
  dbo:wikiPageID            11174052 ;
  dbo:wikiPageRevisionID    677967394 ;
  dct:subject               dbc:Semantic_Web , dbc:Internet_terminology ,
    dbc:World_Wide_Web , dbc:Cloud_standards , dbc:Data_management , dbc:Distributed_computing_architecture ,
    dbc:Hypermedia ;
  owl:sameAs                dbpedia-ja:Linked_Open_Data , dbpedia-ko:링크드_데이터 , dbpedia-el:Linked_Data ,
    dbpedia-es:Datos_enlazados , dbpedia-it:Dati_collegati , dbpedia-nl:Linked_data , wikidata:Q515701 , dbr:Linked_data ,
    dbpedia-pt:Linked_data , dbpedia-fr:Web_des_données , dbpedia-wikidata:Q515701 , <http://rdf.freebase.com/ns/m.02r2kb1> ,
    dbpedia-eu:Datu_estekatuak , yago-res:Linked_data ;
  prov:wasDerivedFrom       <http://en.wikipedia.org/wiki/Linked_data?oldid=677967394> ;
  foaf:isPrimaryTopicOf     wikipedia-en:Linked_data .

Now this is pretty neat (particularly in that there is a lot of multilingual information), which is invaluable if you are working on global projects such as LEIs) but you are looking at the data through a peephole. You have no idea of what other records exist, what records link to this record, what predicates exist in the database, etc.

A popular response to that is the Public SPARQL Endpoint, which lets you write SPARQL queries against a data set. SPARQL is flexible, and you can write all kinds of exploratory queries. For instance, the following query finds topics that share a large number of predicate-object pairs with dbr:Diamond_Dogs, a David Bowie album:

select ?s (COUNT(*) as ?cnt) {
   dbr:Diamond_Dogs ?p ?o .
   ?s ?p ?o .
} GROUP BY ?s ORDER BY DESC(?cnt) LIMIT 10

and if you run this against the DBpedia Public SPARQL endpoint you get a very nice list of similar topics.

s cnt
http://dbpedia.org/resource/Diamond_Dogs
158
http://dbpedia.org/resource/Aladdin_Sane
50
http://dbpedia.org/resource/Station_to_Station
47
http://dbpedia.org/resource/Young_Americans_(album)
44
http://dbpedia.org/resource/Low_(David_Bowie_album)
41
http://dbpedia.org/resource/Lodger_(album)
41
http://dbpedia.org/resource/Never_Let_Me_Down
39
http://dbpedia.org/resource/Let's_Dance_(David_Bowie_album)
38
http://dbpedia.org/resource/Hunky_Dory
36
http://dbpedia.org/resource/Tonight_(David_Bowie_album)
36

That particular query takes a few seconds to run, but it's easy to write a similar query that consumes more resources such as

select ?s (COUNT(*) as ?cnt) {
   dbr:David_Bowie ?p ?o .
   ?s ?p ?o .
} GROUP BY ?s ORDER BY DESC(?cnt)

if you run that query on the public SPARQL endpoint (please don't), you'll get a much less nice result:

Virtuoso S1T00 Error SR171: Transaction timed out

SPARQL query:
select ?s (COUNT(*) as ?cnt) {
   dbr:David_Bowie ?p ?o .
   ?s ?p ?o .
} GROUP BY ?s ORDER BY DESC(?cnt)

This is not just a problem with SPARQL, it's a problem that affects any API. If an API is simple and only allows you to do a limited number of things, the cost of running that API is predictable, so it can be offered for free or for sale at a specific price per API call. If an API lets you do arbitrarily complex queries, however, the cost of a query can vary by factors of a million or more, so resource limits must be applied.

An alternative to the public SPARQL endpoint is the private SPARQL endpoint. Here you install a triple store on your own computer, load data, and then run your own queries. People who follow this route run into two problems:

  1. it takes a lot of hardware. You need 16 to 32GB of memory to comfortably work with DBpedia in a triple store. Memory upgrades aren't that expensive today, but most laptop computers have a limited number of memory slots. Other people don't want to tie up their computer for hours with a task that can slow it down, repeat or it
  2. it takes a lot of time and technical skill; for one thing, many triple stores lack an effective bulk loader. Openlink Virtuoso has a good bulk loader, but it takes effort to configure it for great performance and reliability. It can take several hours to load a large data set, and if mistakes mean you need to repeat the load several times this can be a cumbersome and frustrating process. (Without automation, you might be tempted to use a data set that than perfect to avoid the process of doing a reload to get it right)

The AWS Marketplace lets us team up with Amazon Web Services to sell you a package of matching hardware, software and data. (See our product, the Ontology2 edition of Dbpedia 2015-10) It is much easier to automate the build process in the cloud, because we always start with an identical cloud server which has a fast connection to the net, as compared to an installer which would need to adapt to whatever the state of your desktop or server is.

Doing things with the cloud that can't be done otherwise

Cloud computing became popular as quickly as it has because it builds upon things we're familiar with. For instance, in Amazon EC2, we're working with servers, disk volumes, virtual networks, and other artifacts that we'd find in any data center. Cloud migration is often a matter of moving applications that are running on real servers onto virtual servers, without a big change in the system architecture.

For ordinary IT applications, the primary benefit of the cloud is simplification of operations. In terms of economics, the capital cost of buying servers is replaced with an hourly rate that is all inclusive. If you're aggressive about lowering costs and keep your servers busy, you can definitely save money with dedicated servers but it's a lot of work. The largest and most margin-sensitive companies like Google and Facebook will run their own infrastructure, but for more and more customers, the convenience of the public cloud wins out.

There is a class of applications, however, that can function only in the cloud. John Schring gave a really great talk about how Respawn Games used Microsoft's Azure to support Titanfall, a groundbreaking multiplayer game. I'm going to summarize what he says here:

Up until that point, online shooter games used a peer-to-peer model, where one of the player's computers would be selected to run a game server. This was necessary because the economics did not work for a game based on dedicated servers. With dedicated servers, a game developer would need to buy racks and racks of servers before the game launches, guessing how many would be needed to support the game. Buy too many and it is a financial disaster, and buy too few, and dissatisfied players will kill the game with bad word of mouth. Although game developers can't predict how many copies of the game will sell or how many people will sell in the first week, it's predictable that the game will be played heavily when it first comes out, and then usage will drop off -- meaning that dedicated servers won't be efficiently utilized.

The peer-to-peer model is limiting, however, because the average gaming PC or game console isn't intended to be a server. For instance, most consumer internet connections are asymmetric, with much more bandwidth available for download rather than upload. A player's computer is busy playing the game, which limits the resources available to the game server. Traditionally, the complexity of the world in a multiplayer game is limited by this -- players might not be aware of how it is limited, but the size of the world and the complexity of interactions in it is sharply constrained.

Titanfall used the Azure cloud in a straightfoward but powerful way. When players joined the game, Respawn's system would launch new game servers in Azure and then tear them down when the game was done. This way, Respawn could put 12 human players (some operating giant robots) in a complex world populated with an even larger number of A.I. characters. Scalability works both ways: Respawn could handle the crush of launch day but still afford to run the game for years afterward with just a trickle of die-hard users, earning loyalty that has gamers waiting for Titanfall 2.

That's the kind of application we want to support with Real Semantics; you don't need a cloud account to use Real Semantics -- you get can work done with Real Semantics on any ordinary laptop or server, but if you ever have a job so big you need a server with 2TB of RAM and 128 CPU cores it is available at the push of a button.

Packaging DBpedia With Real Semantics

:BaseKB, DBpedia and a bit of history.

By the Spring of 2012 I had tried a number of different ways of getting data out the Freebase quad dump. It was not that hard to figure out how to get little bits of data out here and there, but without any documentation about how Freebase's proprietary database worked, there wasn't a complete solution.

At that time, Freebase was of interest to people who were interested in DBpedia but who had problems with the quality of DBpedia data. Because DBpedia is based on information extraction from Wikipedia, errors in DBpedia can be fixed only by changing data in Wikipedia (which has to be done manually) or by editing the extraction rules. Mainstream releases of DBpedia occur every six months or so, which is not fast enough to have a closed feedback loop. (There is DBpedia Live, but this is incomplete and not always reliable) Freebase, on the other hand, accepted both human and automated edits, and had a quality control process in place that made the data more reliable.

In the Spring of 2012 I cracked the code of the Freebase quad dump, liberating it from their proprietary database. Once the system of naming was figured out, it was straightforward to convert their data dump into industry standard RDF that could be used with industry standard SPARQL databases. Then, Freebase released an updated quad dump every week, and my first interest was developing a sustainable system to do that conversion on a weekly basis.

Over time, however, things changed. Spurred by my work, Freebase discontinued the quad dump and came out with their own official RDF dump. Over the course of the next year, I developed a new generation of tools, namely the infovore framework aimed at the cleanup and purification of the offical RDF dump, since the official dump contained hundreds of millions of superfluous, duplicative, invalid and sometimes harmful facts. Working towards the goal of a product that "just works" for users, I started developing products for the AWS marketplace.

In principle, producing a product for the AWS Marketplace is a matter of producing an Amazon Machine Image (AMI). This can be done by installing the software on a cloud server and then making a snapshot.

If you had to do this just once or twice, it wouldn't be hard to do by hand, but I learned very quickly that one might need to produce images multiple times. For instance, Amazon has testers who test and approve Amazon Marketplace applications -- mostly they check to see that they can follow the instructions for a product to work, but they do this with different server types, in different availability zones. Like any other software tester, however, they'll send the product back to you if they find a problem. Unavoidably this costs a least a day in calendar time (a reason to do things right the first time) -- and it is especially expensive when rebuilding the server means waiting hours to load a large dataset.

The straw that broke the camel's back was the Fall 2014 Shellshock bug; this bug affected most Unix users, but was particularly a challenge for the AWS marketplace because Amazon forced vendors to produce updated machine images. I realized then, that one couldn't build a sustainable business around the AWS marketplace without an automated process for building machine images.

Vagrant, Puppet, and Chef, oh my...

My first attempt at build automation was to use the popular Vagrant software from Hashicorp. Software developers have a way of beating on their machines pretty hard, which means our computers are often different from each other and our production servers in configuration. It can take a lot of work to simply stack up all of the pieces needed to get our work done, but variations between our dev machines can be a big source of stress when something "works for me" but not for somebody else, leading to errors which are difficult to reproduce.

I started usimg using Vagrant to build development environments, so using it to build machine images loaded with data was workable, despite the fact that Vagrant wasn't really designed for building machine images. (Hashicorp provides something called Packer for that.) The existence of two different tools was my first bit of disagreement with hashicorp; if I need a different configuration file to build both a development and a production environment, aren't we just going back to the bad old days where things were normally out of sync?

Over time, I came to see Vagrant more as part of the problem rather than part of the solution. For one thing, it had a number of features, such as virtual networking, that didn't work with AWS and that I didn't use anyway. I was doing almost all of my provisioning with the bash shell and not using the many built-in plug-ins to support Puppet, Chef, Salt, Ansible and many other configuration management systems that, seemed to me to be just another complex thing to learn. The final issue was that I found the Ruby-based internal DSL awkward, having to write obscure code like:

Vagrant.configure("2") do |config|
      config.vm.synced_folder ".", "/vagrant", disabled: true
      config.vm.provider :aws do |aws, override|
      aws.instance_type="c3.xlarge"
   end

  config.vm.provision "shell", inline: "echo 'source .bash.d/*.sh' >> .bashrc", privileged: true
  config.vm.provision "shell", inline: "mkdir .bash.d", privileged: false
  config.vm.provision "file", source: "~/.netrc", destination: ".netrc"
  config.vm.provision "shell", inline: "apt-get install -y python-dateutil"
  config.vm.provision "shell", inline: "apt-get install -y axel"
  ...
  config.vm.provision "shell", path: "install-virtuoso.sh", privileged: true
  config.vm.provision "shell", path: "add-virtuoso-service.sh", privileged: false
  ...
end

when all I really wanted was a few lines of shell script. Small problems became really irksome because it would take five minutes for a server to start and scripts to run to discover that a file I'd copied into the system was rejected by the Unix server because it had Windows line endings. (A problem that the Vagrant developers think should be addressed by adding more compleity to your Vagrantfiles.) Up until Spring 2016 I was still using Vagrant to build machine images for the LEI Demo, but with new products in the AWS marketplace about to be developed, I saw that I didn't have time to waste on a build automation framework that didn't value my time.

A changing environment

Just as I was getting more serious about automation, bigger changes were happening in the world of generic databases. Google, which had bought Freebase, made the decision to shut it down, after incorporating Freebase data and technology into the proprietary Google Knowledge Graph. In the meantime, some interest has shifted to Wikidata, and DBpedia has been steadily improving. In October 2015 I also attended the first DBpedia meetup to be held in the U.S., which was at Stanford University.

As I was learning about popular interests in databases like DBpedia and Freebase, I was simultaneously (i) focused on exploiting traditional data sources with RDF tools and (ii) using data from sources such as DBpedia and Freebase to support that. Excited by the improvements in DBpedia 2015-10, I began the process of rolling out a new generation of AWS marketplace products based on Real Semantics packaging technology.

Henson: packaging code and data with Real Semantics

Image generation as if productivity mattered

henson, named after the illustrious puppeteer, is the Real Semantics module that handles cloud packaging. Starting from a clean slate, henson is much faster and more reliable than our old system that used Vagrant for the kind of work that we do. Here is a diagram of the build process:

This process has the following steps:

  1. Prior to the product build, most of the files that support the build process, including several gigabytes of DBpedia data files have been transferred to and stored in Amazon S3. This means the files are quickly available to the build server, and the process is not affected by the relatively slow internet connection between the henson server and AWS.
  2. Working from an RDF job description, the henson server compiles a cloud-init script. This is primarily a bash shell script that is assembled from bits and pieces of bash scripts in henson's library, united with a standard mechanism for passing configuration information from the henson server to the build server. In most cases, henson can embed all the required configuration in the user data. If necessary, however, henson will upload additional configuration data for this instance to S3.
    1. Just before launching the build server, henson also creates an Amazon SQS message queue to receive communications returning from the build service. This way we can monitor a wide range of events on any number of servers with a single polling loop and a single strategy for handling events.
    2. henson embeds hooks in the cloud-init shell script that report on the progress of the script and to report errors in a standardized way using the message queue. henson knows with certainty if the script succeeded or failed in terms of shell errors and at which stage the failure occured.
    3. The cloud-init script is designed to function without further interaction with the henson server (other than progress tracking) until the database build is complete. This addresses a problem seen with Vagrant, which is that if a process takes several hours to complete, there is a good chance that the internet connection between the control server and the build server will glitch out. Transient internet outages create real and imagined errors in the build process which undermined the reliability and determinism of the build; minimizing round-trip communications dramatically improves reliability and performance.
    4. Henson assigns an IAM Instance Profile to the build server. This delegates authority to the build server in a secure way so that the build server can access S3, SQS, and any other resources in AWS. This supports the independence of the build server, since the build server can provision any resoures it needs in the cloud for itself, such as additional disk volumes.
    5. henson creates a directory in a designated location in which information such server configuration, process status, timing information, captured server logs and other information about the build is stored
  3. The cloud-init script downloads supporting files from S3 to do its job; if possible, software packages are installed using apt-get but if necessary, specialized software packages are compiled and installed with maven or with the GNU toolchain. To save time and improved determinism, we will often make a machine image with software installed, then start from this image when we load the database.
  4. The cloud-init script next scans S3 to measure the total size of the RDF data files to be loaded. The script then creates an EBS volume large enough to handle the input data files, attaches it, formats the volume, and copies the files to the volume.
  5. The cloud-init script examines the resources available on the machine, such as RAM and direct attached disks and updates the configuration file for the database with tuning parameters ideal for the environment. Preferably we use a build server (such as R3 series server) which has an internal SSD for high throughput and low latency I/O.
  6. The cloud-init script then configures and starts the database bulk loader. In the case of OpenLink Virtuoso, this involves the following stages:
    1. Inserting rows into a relational table that tell Virtuoso where to find the files
    2. Running one or more instances of a stored procedure that starts the bulk load process
    3. Polling a relational table to detect completion of the load
    This series of steps converts what would otherwise be an asynchronous operation into a synchronous operation that is easily composable with the steps that preceed and follow it.
  7. The database contents are flushed to disk, and the database shut down
  8. A new EBS volume is created, large enough to hold the database files plus an additional margin for temporary files. Database files are copied to the EBS volume and the database configuration files are edited to find the database in the new location.
  9. The cloud-init script completes, sending a message to the SQS queue notifying the henson server that the load is complete. Had the shell script (or a command run by the script) unexpectedly completed with a non-zero exit value, an SQS message indicating failure would be sent to the henson server
  10. The henson server uses the SFTP protocol to download diagnostic files from the server such as cloud-init.log and cloud-init-output.log. henson applies both general-purpose and task-specific rules to determine if the build was successful before moving on the the image creation procedure; if the build fails, the ruleset extracts and display a small set of lines immediately around the failure point to help the operator quicjly diagnose the problem.
  11. A final procedure erases any security sensitive information on the build server, such as security keys, and clears the database password stored on the instance. When a machine boots off the image, the user's public key will be installed and the system will generate a new database password. This satisfies the security requirements for a product in the AWS marketplace. This script then shuts down the sever cleanly.
  12. henson then commands the EC2 control plane to make an machine image from the build server
  13. When the machine image is ready, henson destroys the build server and all of the associated EC2 resources
  14. Finally henson displays a report including the identity of the created image and (if it is running on a machine that supports sound with JavaFX) plays a pleasant chime to indicate success. Different sounds are played in case of warning or failure.

Note that we can set henson up to do a different combinations of functions in different cases. If we were interested in using the server directly, for instance, instead of creating an image, that is straightfoward. henson can be used to create development servers and then later be used to create a product image. Unlike Vagrant and some other tools, henson puts developer productivity first -- by minimizing the possibility of making a mistake with a build, running the build as fast as possible, automatically diagnosing problems with the build, and notifying the operator of the status of the build immediately so that person's attention can be put completely on something else while waiting.

Paths not taken (at least so far)

The current version of the henson system is focused on a single platform, Amazon Web Services, in the pursuit of simplicity. As it becomes necessary, however, it should be straightfoward to adapt to other systems that work in a similar way:
  • Azure, DigitalOcean, Google Cloud Platform, Openstack:
    • The overall procedure for building servers and images is much the same on major cloud platforms. In particular, all major cloud platforms support cloud-init. Although cloud-init was originally introduced in Ubuntu Linux it has since been incorporated in Amazon Linux and Red Hat Linux. Such universal support is critical, since the generation of cloud-init scripts is a core art of henson's design.
    • In Real Semantics, we would add support for a new cloud by first using javagate to generate an RDF bridge to the Java SDK for that cloud's API. Any particular cloud requires configuration specific to that cloud: in AWS, for instance, we need to configure a region, availability zone, VPC, instance type and subnet. In any other cloud there are a set of similar but somewhat different concepts. No matter what, the operator writes a Turtle file that defines a template for the configuration, and henson can patch that configuration with just a few properties without much understanding or modification for the cloud-specific configuration. (This approach means, even in AWS, we can accomodate either the sophisticated user who needs to match the configuration of an existing network or the beginnner who wants a default configuration that "just works")
    • Although many competitive clouds have their own message implementations, Amazon SQS can still be used as a backchannel for henson because the volume of information transferred through the queue is low and the number of round trips make communication latency unimportant. As AWS gives users 1 million free queue API calls a month and charges $0.50 per additional million requests, any costs involved are trivial.
    • Similarly, AWS S3 can be accessed throughout the internet, so that it can still be used as a file store for systems running in another cloud. Many other cloud platforms provide APIs compatible with Amazon S3.
  • Virtualbox, Hyper-V VMWare and other local virtualization hosts:
    • Compared to cloud platforms, a virtualization host on your local computer has a faster network communication to your machine and incurs no cloud computing costs.
    • Platforms such as Virtualbox can be configured to use cloud-init
    • , so the henson model should be adaptable to this environment.
  • Docker, Kubernetes, Mesos:
    • Container technology is more efficient than virtualization technology because:
      • Container systems isolate components without running multiple copies of the operating system: particularly, this means that docker containers can be created in seconds instead of minutes.
      • Some container systems (particularly Docker) can make changes to filesystems in distinct layers that save space and the time cost of repeating software installation steps.
    • Despite those big advantages, container systems come with a lot of baggage:
      • The container environment is not the same as a normal UNIX environment, and we can't take it for granted that every application we want to run will run unmodified in a container
      • The image build process in Docker is specifically designed for handling relatively small developer images, compared to images that could contain 50GB or more of data
      • Fairly large differences exist between Docker environments, such as Docker for Windows and the Amazon EC2 Container Service. For instance, Docker supports a number of different storage drivers which work in different environments have different quirks.
    • Docker images are initialized with Dockerfiles, which use shell scripts much like cloud-init but are substantially different.
    • Most containerization systems come not only with an isolation mechanism but also mechanisms for service discovery, cluster orchestration, etc. That's the good news. The bad news is that these systems are entirely different and incompatible.
    • Some large companies such as Twitter and Google have standardized on container systems for data-rich applications. These companies, however, can afford to throw large teams of highly skilled developers to work around the constraints of such systems. As a product aimed at extending the reach of rich data applications to a broad middle of organizations, we see containerization as premature for Real Semantics
    • Specifically, container systems do not (in general) hide the complexity of the underlying platform. For instance, if we used the Amazon EC2 Container Service we would still need intensive knowledge of instance types, availability zones, storage options and networking in the EC2 Cloud.
  • Microsoft Windows as a guest operating system:
    • You can run henson and other Real Semantics components on a Windows computer, in fact this is what we use for most development work.
    • At this time, henson only creates and controls hosts running the Linux operating system.
    • A facility called cloudbase-init exists for Windows that parallels cloud-init; together with the increasing support for bash in WIndows, this could lead to support for Windows Guests.

Conclusion

The Ontology2 Edition of DBpedia 2015-10 is simple to use; to make it simple to use, however, we need to extract complexity from the process of constructing the product. Cloud computing lets us: (i) package hardware, data, and code together for customers, (ii) choose a wide range of small and large hardware configuration and (iii) start every product build from a repeatable place. Contrasted with the Legal Entity Identifier demo, which uses henson's product-building ability to create a production server, with our DBpedia product, we create a machine image which can both deliver to customers and use to support additional projects, such as the LEI demo, that require a global database of concepts, places and names.