RDF: a new slant

Too Many Data Formats

All of the time we need to feed a computer program a few facts, and traditionally that involves working with property files, XML, JSON, YAML, CSV, etc. Once a system gets complex and anywhere near the "enterprise" scale, having many different file formats for configuration, reference data, and for specifying the work done by the system becomes a source of stress for it's operators.

Once the job is specified, the program needs to consume and produce data which made be provided through various formats and APIs. The need to translate data from one format to another again and again gets in the way of using a wide range of powerful tools, such as logical inference, machine learning and hybrid systems that seek and find the reality behind data in the real world.

Real Semantics talks all kinds of data formats, but it sees them all using a single data model called RDF/K, an extension of the RDF schema language. Let's suppose you want to state a few facts about the world -- we suggest you write a Turtle file. Turtle is unbeatable for writing facts by hand, and thinking through Turtle will help you make a mental model of what RDF/K is and what it can do.

Capturing a business record

While developing our Legal Entity Identifier site, we captured a data record from an XML file. The system wrote this data as a Turtle file and we included this file in the unit tests so we have certainty this record is correctly processed no matter what changes are made to the code.

Using a single data model (RDF/K) means we need just a single mechanism to capture what facts the system believes at intermediate stages of inference, calculation, or decision. This extreme traceability, available when required, costless when not, is one of many unique features of Real Semantics that are necessary for compliance with the tough BCBS 239 standard in both normal times and crisis. (Who needs the stress of cleaning data in a crisis?)

This example is a simple business entity record from the Legal Entity Identifier system:

    @prefix lei:   <http://rdf.legalentityidentifer.info/vocab/> .
    @prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .

    []
        a                               lei:LegalEntity , lei:ConformantIdentifier ;
        lei:BusinessRegistryIdentifier  "" ;
        lei:BusinessRegistryName        "N/A" ;
        lei:EntityLegalForm             "OTHER" ;
        lei:EntityStatusCode            "ACTIVE" ;
        lei:HeadquarterAddress1         "C/O C T Corporation System";
        lei:HeadquarterAddress2         "155 Federal Street" ;
        lei:HeadquarterAddress3         "Suite 700" ;
        lei:HeadquarterCity             "Boston" ;
        lei:HeadquarterCountryCode      "US" ;
        lei:HeadquarterPostalCode       "02110" ;
        lei:HeadquarterRegion           "US-MA" ;
        lei:LEIAssignmentDate           "2013-05-24T09:30:20.883Z"^^xsd:dateTime ;
        lei:LEINextRenewalDate          "2016-05-04T09:01:27.494Z"^^xsd:dateTime ;
        lei:LEIRecordLastUpdate         "2015-05-07T01:52:22.058Z"^^xsd:dateTime ;
        lei:LEIStatusCode               "ISSUED" ;
        lei:LOUID                       "EVK05KS7XY1DEII3R011" ;
        lei:LegalEntityIdentifier       "549300I00FSB0O13VI67" ;
        lei:LegalJurisdiction           "US" ;
        lei:RegisteredAddress1          "C/O C T Corporation System" ;
        lei:RegisteredAddress2          "155 Federal Street" ;
        lei:RegisteredAddress3          "Suite 700" ;
        lei:RegisteredCity              "Boston" ;
        lei:RegisteredCountryCode       "US";
        lei:RegisteredName              "BlackRock Funds - BlackRock Short Obligations Fund" ;
        lei:RegisteredPostalCode        "02110" ;
        lei:RegisteredRegion            "US-MA" ;
        lei:ValidationSources           "FULLY_CORROBORATED" .

if you squint while you look at it, you might see things in common with many popular data formats. You should, because Turtle is closely related to many common data formats.

Nested and Ordered Structures in RDF and Java

Here is another example:

    @prefix :   <http://example.com/appliances> .
    @prefix dbpedia:   <http://dbpedia.org/resource/>

    [
        a :WashingMachine,:FrontLoadingWashingMachine ;
        :capacity 4.8 ;
        :supportedVoltages 120, 240 ;
        :phases ( "Soak" "Wash" "Rinse" "Spin" ) ;
        :energyCostEstimate
        [
            :source dbpedia:United_States_Environmental_Protection_Agency ;
            :hotWaterSource "electric" ;
            :annualEstimatedCost 16.00
        ],
        [
            :source dbpedia:United_States_Environmental_Protection_Agency ;
            :hotWaterSource "natural gas" ;
            :annualEstimatedCost 14.00
        ]
    ]

Note that Turtle parser is not aware of schemas, vocabularies, and so forth. It takes the code you enter, and turns it into a graph, without validation that you are using the correct facts in the right way. That's fine, because a schema for RDF/K, a K-Schema, expresses the allowable vocabulary and structures and can be used to validate data and/or groom it to a standard. For now, I'm not using a schema, and I'm just coining names in the &lthttp://example.com/> namespace because that's simple.

If you wanted to express these data in the Java programming language, you'd probably imagine a class that looks something like:

public class WashingMachine {
    Float capacity;
    Set<Integer> supportedVoltages;
    List<String> phases;
    Set<EnergyCostEstimate> energyCostEstimate;
}

It takes a tiny amount of metadata to connect that Java class with the RDF written above. One way to do it is to package the metadata with the class in the form of a few Java Annotations:

@Prefix(name = "", uri = "http://example.com/appliances" )
public class WashingMachine {
    @Property Float capacity;
    @Property Set<Integer> supportedVoltages;
    @Property List<String> phases;
    @Property Set<EnergyCostEstimate> energyCostEstimate;
}

The only metadata in this case is (i) the default prefix to map Java names to, and (ii) an assertion that we want to transfer data between a field and RDF. With that in place, we can convert RDF statements above into Java data with the greatest of ease:

    @Test
    public void parseWash() throws InvocationTargetException, IllegalAccessException {
        WashingMachine m=new WashingMachine();
        Resource that=getOnly("http://example.com/appliances/WashingMachine");
        configurator.configure(m,that);
        assertEquals("Set",m.phases.get(0));
        assertEquals("Wash",m.phases.get(1));
        assertEquals("Rinse",m.phases.get(2));
        assertEquals("Spin",m.phases.get(3));
    }

in this test case, we find the only record of type :WashingMachine, convert it to a Java object (by creating the WashingMachine() instance and then applying configurator.configure and then we can check that we got the phases of the cycle in the correct order.

This method of annotation is useful for getting data in and out of Java classes that are written with Real Semantics in mind. The mapping is similar to JSON-LD but it works better because List and Set are used widely and exposed through the static typing of the language in contrast to JSON-LD, which adds new concepts for @list and @set. A strong advantage of doing it this way is that all of the parts in one place, so there is no risk of updating the class without updating the metadata.

Real Semantics can also work with Java objects that are not aware of real semantics; as in the above case, it looks at the Java metadata and processes it with rules that recognize common idioms such as Java Beans and sometimes a little bit of additional metadata you supply. Like the Spring framework, Real Semantics can create arbitrary objects configured with data from RDF. RDF reasoning systems, such as the Jena rules language, can do the kind of reasoning that Spring does, but can also reason about that data in different ways to understand, validate or visualize the construction of a system. RDF has the powerful SPARQL query language which lets you immediately apply analytics to anything.

Viewing Java data in RDF

Here is an example of getting data into Real Semantics from classes that were written before Real Semantics. Like most Java large programs, Real Semantics is compiled by the open source Apache Maven system which expresses and documents the physical structure of the program. The documentation generator that builds this book repurposes this information -- the fast track to that is to use the parser built into Maven to convert POM files into MavenProject objects inside Java.

The POM file is an XML document and Real Semantics could ingest it directly as a tree, however, Maven interpolates parameters and implements specific forms of inheritence and inference that let us work with, not the surface structure, but the deep structure Effective POM that controls Maven.

The documentation generator that built this book reads a map of the maven modules that comprise Real Semantics. This process has the following steps:

  1. The first step of the process is that Real Semantics scans the Java introspection metadata about the MavenProject class and converts this into an RDF graph, which gives us a complete and losless model of the classes, fields, and methods built into that class and selected classes it depends on.
  2. RealSemantics applies one of several heuristic ruleboxes to automatically generate a mappings from Java to RDF. In this case we uses one that recognizes the "Java Beans" standard. If an existing rulebox is not satisfactory, rules can be overridden or a few mappings could be added manually with a fact patch.
  3. Real Semantics generates a set of stub classes and/or objects that implement the transformation.

Because of the intelligence under the hood, you can read the MavenProject object without thinking at all.

    Stub<MavenProject> stub=new CreateStub().create(MavenProject.class);
    MavenProject project = MavenProjectFactory.getMavenProject(pomfile);
    Model that=stub.toModel(project);

the RDF output you get looks pretty natural:

@prefix xsd:  &lthttp://www.w3.org/2001/XMLSchema#>
@prefix unq:   <http://rdf.ontology2.com/unqualified/> .

[
    unq:artifactId                "javagate" ;
    unq:description
        """It will be necessary to import and export data between RDF and objects that were not developed with Real Semantics in
           mind.  This can be done in a remarkably transparent way,  given that there are conventions,  such as Java Beans,  that
           the system can take advantage of.  Javagate converts class metadata from Java into an RDF graph,  which can be queried
           and transformed with rules.  From this,  it create a Stub object which does the Java to RDF transformation."""^^xsd:string ;
    unq:groupId                   "com.ontology2.rdf";
    unq:id                        "com.ontology2.rdf:javagate:jar:1.1-SNAPSHOT";
    unq:modelVersion              "4.0.0" ;
    unq:version                   "1.1-SNAPSHOT";
    unq:name                      "javagate";
    unq:packaging                 "jar" ;
    unq:modules                   ();
    unq:runtimeClasspathElements  ();
    ... many more facts ...
] .
at this point, we can scan a large number of POM files and put the facts all into one RDF graph, which the Jena framework calls a Model.

What does that buy us?

Just above, we used a library in Java (our host language) to extract information from a POM then we import that data into an RDF graph. Our project consists of a number of POM files, so we import them all into an RDF graph. What good is that?

For one thing, we can write queries in the SPARQL language, which is closely related to SQL. The POM files together form a map of the Real Semantics application that is built into this book. Data in hand, we can write the following SPARQL query:

prefix unq: <http://rdf.ontology2.com/unqualified/>

select ?id ?name ?description  {
    ?project unq:id ?id .
    ?project unq:name ?name .
    OPTIONAL { ?project unq:description ?description .}
}  ORDER BY ?id

this produces a SPARQL result set exactly similar to a SQL result set that we use to draw a map of RealSemantics, module by module, that is displayed on this page, and of which we'll show you a little sample here:

docminister
Generator and weaver of reports and documentation. Captures documentation about the input data, specifications and software. Take advantage of mechanisms that already exist to express metadata and documentation. This applies both in the area of documenting software like Real Semantics itself, but also in creating a bundle of reports tied with a bow that explain some range of natural or social phenomenon.
henson
Henson is the Real Semantics component that sconfigure, create and snapshot virtual machines with software and data in a cloud environment. Ensures that Real Semantics can get any computing resources it needs to create decision-making data products
java-annotations
Sometimes Real Semantics needs to read metadata about software components it uses. For instance, the mogrifier maintains a catalog of components exposed to end users. If code is written post-Real Semantics, it is practical to stick a few Java annotations on the new Java code to express class-level metadata This keeps the metadata bundled with the code, which keeps the metadata in sync. This module also defines a few annotations for defining namespace prefixes which are re-used in the rdfconfig-annotation package which injects class data into RDF

This is the simplest possible example, but it illustrates that once you get data in RDF format, you can (i) write queries against it and (ii) put multiple objects of various kinds in a single graph and write queries against that. Without a universal data model, you are stuck writing queries in different languages such as SQL and XQuery if you are so lucky to have a query language for a particular data format. With RDF, SPARQL, and real semantics, you can write queries against any kind of data.

Ordered Lists in RDF

Two kinds of collections of things are commonly used in computer programs, and these are List and Set. The items of a list are in a definite order, like the authors of a book, but other properties, such as the collection of booksellers who sell the book, are not. Usually, duplicates are eliminated from an unordered collection, which makes it a set.

Two kinds of collections of things are commonly used in computer programs, and these are List and Set. The items of a list are in a definite order, like the authors of a book, but other properties, such as the collection of booksellers who sell the book, are not. Usually, duplicates are eliminated from an unordered collection, which makes it a set.

Set properties have always been used extensively in RDF. Particularly, you can make multiple statements that "some ?subject has ?property with value ?object quite easily:

    @prefix : <http://example.com/>

    :Pool a :GameFamily .
    :Pool :hasVariant :EightBall .
    :Pool :hasVariant :NineBall .
    :Pool :hasVariant :Straight .

and thus define a set of variants of the game of Pool. The graph above has four independent facts written out one at a time. Turtle has a convenient shorthand which is just

    @prefix : <http://example.com/>

    :Pool a :GameFamily ;
       :hasVariant :EightBall , :NineBall, :Straight .

The semicolon lets you state an entirely new property , while the comma lets you specify multiple objects that share the same subject and property. This collection has Set semantics because you can only enter a fact into the graph once.

Although ordered lists have been a part of RDF standards from the very beginning, they have been a bit out of fashion in the age of "Linked Data", which involves large and complex datasets such as DBpedia. Ordered lists are missing from common SQL implementations, so practically, generations of analysts have learned to work around this about 90% of the time, yet, the special cases that require ordering hold back general solutions baced on legacy technology. Real Semantics, through RDF/K and other features, makes ordered lists easy to work with.

You can write ordered Lists in Turtle exactly the same way you would in LISP.

@prefix : <http://example.com/>

(:A :B :C)

This list exists, in itself, apart from any statements that involve it. Really, (:A :B :C) is a name for a blank node such that

@prefix : <http://example.com/>

(:A :B :C)
    rdf:first :A ;
    rdf:next (:B :C) .
(:B :C)
    rdf:first :B ;
    rdf:next (:C) .
(:C)
    rdf:first :C ;
    rdf:next () .

we picture that graph here:

Let's add also that () is rdf:nil. This particular representation is called a "Linked List" and it is quite similar to the LinkedList in Java. Let's fill out a few of the things you can do with this notation.

You can state a fact about a list by using a list on the left side of the predicate, like so:

@prefix : <http://example.com/>

("foo" 75 :something) a :RandomList .

At this point you might be asking, "What am I allowed to put into the list?" and the answer is "anything", at least any kind of RDF Node:

@prefix : <http://example.com/>

( :hello [ :a :Person ; :named "John" ] (:goodbye 3 4))

Here we see members that are URI Resources, such as :hello and :goodbye but in the middle there are some facts about a blank node and the end is another list. This means you can build the same kind of structures you would in JSON, and even write LISP code in Turtle!

( <fn:numeric-add> 2 2 ) .

given, of course an implementation of the eval function that works on the list. Practically, Real Semantics tries as much as it can to hide the mechanics from you, when it is moving data between RDF and some other format, such as Java.

Representing multiple data values in Java

From the viewpoint of Real Semantics, there are three kinds of Java type:

  1. Primitive:
  2. this could be a String, Java Primitive (Integer, Boolean) or a type that looks like a primitive such as an OffsetDateTime and would be typically represented by an RDF literal of some kind.
  3. Composite:
  4. this is a reference to another type which is known to the Real Semantics system; typically an instance of such a type would be described as a set of RDF properties centering around either a URI or a blnak node resource.
  5. Collection:
  6. certain generic types, such Set<T> and List<T> can be automatically mapped to appropriate RDF structures.

RDF, at a raw level, allows you to use a property any number of times. You are certainly welcome to apply a property exactly once to a subject, like this

@prefix : <http://example.com/>

:Orange
    :red 200 ;
    :green 200 ;
    :blue 0 .

and if you want to assign that to a Java class that looks like

class Color {
    Integer red;
    Integer green;
    Integer blue;
}

you are in pretty good shape. The Turtle language lets you write:

@prefix : <http://example.com/>

:Colour_out_of_Space
    :red 100,200 ;
    :green (200 200 200) ;
    :blue 0 .

in which case there are two values for the red property (without a specific order) and three values for the green property (ordered as a list of three elements.) Either way, you can't stick multiple values in an Integer field, so Real Semantics gives an error message if you tried to make a Color from this data. This is a behavior that RDF/K adds to the RDF standard.

The handling of Set<T> and List<T> is straightforward. By default, Real Semantics is permissive about what you can do. That's good, because we find people are often sloppy in choosing Lists vs. Sets (sometimes they use one while another would do.) If we had a class like

class PolyColor {
    Set<Integer> red;
    Set<Integer> green;
    Set<Integer> blue;
    Set<Integer> alpha;
}

Real Semantics can see the schema implied by the types, and makes the natural transformation, where red is the set [200,100] because order doesn't matter in a set, green is the set [200] because members of sets are unique and blue is the set [0] (we promote a single element to an set or a list that just contains that element and alpha is the empty set [].

If you assign to a List, something similar happens:

class OrderedColor {
    List<Integer> red;
    List<Integer> green;
    List<Integer> blue;
    List<Integer> alpha;
}

red is a list in an arbitrary order (either [100,200] or [200,100] -- a standard Turtle parser can't keep track of the order of "," elements), green is the list [200,200,200] (multiple elements are allowed, order is preserved), and blue is the list Real Semantics can see the schema implied by the types, and makes the natural transformation, where red is the set [200,100] because order doesn't matter in a set, green is the set [200] because members of sets are unique and blue is the set [0] (we promote a single element to an set or a list that just contains that element) and alpha is the empty set [].