K Schema Level 0

Overview

The K Schema is closely related to RDFS, OWL, JSON-LD, XML Schema, RDF Data shapes and other schema languages for the web. The K Schema is different, however, in that it takes a pragmatic approach, built on 15 years of experience with RDF. Like RDFS, the K schema benefits from the simplicity of the graph data model, but unlike RDFS, it is specifically designed to model legacy data structures (relational, object-oriented, etc.) and target the problems of data cleaning and validation that, so far, have not been standardized in the RDF community.

Goals of K Schema Level 0

The K Schema is organized in several levels. The first level, level zero, is about naming names and defining instances. So far as RDF itself is concerned, no entity is different from any other entity. For instance, the resource URI <http://example.com/someExample/> is just an entity, not different from any other. Yet, practically, some kind of entities have an official status and others do not.

We need to look no further than the RDF Schema (RDFS) to find some specific meaningful terms. For instance, with the common prefix declaration

@prefix rdfs: &lthttp://www.w3.org/2000/01/rdf-schema#>
    

we can refer to resources such as rdfs:Container, rdfs:Class, rdfs:subClassOf and rdfs:seeAlso. There names are relevant to RDF Schema and to tools that process RDF Schema, but they are not special in RDF itself. One consequence for that is there is no error checking when you load RDF data, or create RDF objects in Java. For instance, if you write some RDF like

@prefix rdfs: &lthttp://www.w3.org/2000/01/rdf-schema#>
@prefix : <http://example.com>

:ProfessionalAthlete a rdfs:Claas .
    

where we wrote rdfs:Claas (sic) instead of the correct rdfs:Class, conventional RDF and RDFS tools will ignore the statement, since they just ignore vocabulary they don't know. This is a problem if you are writing facts in Turtle or queries in SPARQL; I'm convinced that one of the major reasons people get frustrated with SPARQL is that they don't get feedback when they spell terms incorrectly.

This problem also turns up when writing RDF-powered software in a conventional programming language. For instance, to implement the K schema and all of the other features it had, the Java code in Real Semantics continuously refers to RDF terms. You could put the names of terms as strings where they are used in the code, but then you have two problems: (i) this is error prone, and (ii) you have to write boilerplate code to turn those strings into the object you need which is cumbersome and wasteful. To provide an easy programming experience, Real Semantics generates Java classes from K Schemas in a way that is integrated with the traditional Maven build process. A significant advantage we get out of this is that Java IDEs such as IntelliJ IDEA see the generated Java classes and can autocomplete RDF terms, which is just the first way we can empower IDEs with schema inelligence.

The K Schema is divided into a few levels because of the "chicken and egg" problems that come up when you try to implement it. Most importantly, the system that generates Java stubs can't generate stubs for itself, so by creating a "Level 0" schema that implements the minimum functionality needed for the stub generator, we can use a fully functional stub generator and the rest of our bag of tricks to implement the rest of the K Schema.

K Schemas are about a namespace

A K Schema is fundamentally a group of statements about a namespace. We declare the properties of a namespace like so

@prefix : <http://rdf.legalentityidentifer.info/vocab/>
@prefix k: <http://rdf.ontology2.com/korzybski/>

:
    a k:Namespace
    k:inPackage "com.ontology2.matcher" ;
    k:className "LEI" .

:RegisteredCity
    a k:Property;
    k:description "The name of the city in which this entity is registered." .

:NonconformantIdentifier
    a k:Resource ;
    k:description "An LEI Record with an incorrectly structured identifier" .

in this case we are defining properties of the <http://rdf.legalentityidentifer.info/vocab/> namespace, and using the empty prefix : out of convenience because we'll be using it a lot in this schema. The core things going on here are:

  1. We are defining that <http://rdf.legalentityidentifer.info/vocab/> is a namespace with the a statement
  2. specifying the Java package name for the stub class to be generated, and...
  3. specifying the name of the stub class to be generated

Inside Real Semantics, there is a maven plugin named korzybski-maven-plugin which is responsible for generating Java stubs. Typically you will generate the stub in the same maven module that you define the K schema in, and you make that happening by adding the following to the <plugins> section your pom.xml for that module:

<plugin>
    <groupId>com.ontology2.rdf</groupId>
    <artifactId>korzybski-maven-plugin</artifactId>
    <version>1.1-SNAPSHOT</version>
    <configuration>
        <schemaFiles>
            <schemaFile>src/main/resources/documentation/matcher.ttl</schemaFile>
            <schemaFile>src/main/resources/documentation/lei.ttl</schemaFile>
            <schemaFile>src/main/resources/documentation/korzybski.ttl</schemaFile>
        </schemaFiles>
    </configuration>
    <executions>
        <execution>
            <phase>generate-sources</phase>
            <goals>
                <goal>rdfStubs</goal>
            </goals>
        </execution>
    </executions>
</plugin>

That bit of code will cause the stub generator to run before the regular Java source code files in your module. The key configuration is that you specify the path to your K schemas in the <schemaFile> element. There is just one thing you have to do and that is tell the compiler where to look for the generated source code files.

<plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>build-helper-maven-plugin</artifactId>
    <version>1.10</version>
    <executions>
        <execution>
            <phase>generate-sources</phase>
            <goals>
                <goal>add-source</goal>
            </goals>
            <configuration>
                <sources>
                    <source>generated-sources/korzybski/main/java</source>
                </sources>
            </configuration>
        </execution>
    </executions>
</plugin>
    

At this point, the code generator is integrated with your code with very little drama. When you type mvn install the code generator runs before the code is compiled, so you get something that looks like:

package com.ontology2.matcher;

import org.apache.jena.rdf.model.*;

public class LEI {

    public final static String NAMESPACE="http://rdf.legalentityidentifer.info/vocab/";

    public static Resource url(String s) {
        return ResourceFactory.createResource(NAMESPACE+s);
    }
    public static Property prop(String s) {
        return ResourceFactory.createProperty(NAMESPACE,s);
    }

    public static final Property RegisteredCity = prop("RegisteredCity");
    public static final Property RegisteredAddress2 = prop("RegisteredAddress2");
    public static final Property SuccessorLEI = prop("SuccessorLEI");

    ...

    public static final Resource ConformantIdentifier = url("ConformantIdentifier");
    public static final Resource NonconformantIdentifier = url("NonconformantIdentifier");
    public static final Resource OperationalLOU = url("OperationalLOU");
}

Note that the K schema distinguishes between Resources and Properties, partly because the Jena Framework makes the distinction, and partly because it is an important distinction for validation. In Java, Property is a subclass of Resource, so you can use a Property anywhere you can use a resource, but not elsewhere. Here are some examples in code.

    // count the number of :RegisteredCity facts
    int cityFactCount=Iterators.count(model.listStatements(null,LEI.RegisteredCity,(RDFNode) null);
    // count the number of nonconformant identitifiers
    int badCount=Iterators.count(model.listStatements(null,RDF.type,LEI.NonconformantIdentifier);

The thing to notice is that the three arguments to listStatements() from Jena are the subject, predicate, and object fields respectively. The subject is of type Resource, the predicate is of type Property and the object is of type RDFNode. RDFNode is a superclass of Resource. Resource can be either a named node with a URI or an anonymous (blank) node, whereas an RDFNode can be either a Resource or a literal value.

Additional Constraints at the namespace level

By default, a namespace with a K schema is Open that is, your are allowed to use any resources or properties whereever you want. This is the same behavior that you'd see with an OWL or RDFS schema. Sometimes this is a good behavior, for instance, if a namespace contains a huge number of resources. For instance, the <http://dbpedia.org/resource/> namespace more than ten million distinct resources, such as dbpedia:Asparagus_Fern; it is not feasible or desirable to name every term which could be used in such a namespace.

The stub classes support that, in the sense that there are uri() and prop() methods on the stub classes that will create properties and resources for you quite easily:

Resource fern=DBPEDIA.url("Asparagus_Fern")

In cases where you don't want to allow people to use unregistered properties or resources in a namespace, you can do that by writing

<http://example.com/someNamespace> a :ClosedPropertyNamespace .

or

<http://example.com/someNamespace> a :ClosedNamespace .

to specify that properties that aren't listed in the namespace are not allowed. Note that this does not have a direct effect on the Jena RDF engine or conventional RDF tools. In particular, external triple stores will not automatically enforce the validation. At any time, however, the K validator can be applied to a graph to check compliance with a K schema. This is something you can do manually in Java code, but also an operation that Real Semantics can apply automatically when you take advantage of the automatic loading system or process data with the mogrifier