HTML: the official document language of Real Semantics

Introduction

One goal of Real Semantics is the unification of document management, ontology management, database management and source code control. If you're working on a complex domain, it's critical to split many hairs very precisely. This is straightforward if we build those definitions into an ontology, document those definitions with authoritative texts, and then link ontology and instance data to source code and other artifacts that execute these defintions.

As much as possible, Real Semantics breaks with the common practice of using textual template languages to generate documents (including software code and configuration files) and instead relies on internal representations that let us operate on documents with meaningful operations. The image below illustrates the process. Real Semantics incorporates HTML documents (as well as data and other documents) by converting HTML into parse trees and then operating on them with meaningful operators. This makes it possible to break HTML documents into components and then put those components together in a different way. In one example, for instance, a Real Semantics application could read an HTML document from a file, extract content from it, and display that content in a template defined by an ordinary HTML file (with no special markup) embedded as a resource in the application.

HTML 5, by formalizing the treatment of imperfectly formed documents, makes it possible for libraries such as JSoup to process HTML in a structure-preserving style similar to how XML documents are commonly processed. Instead of embedding template variables as done with PHP, Freemarker, Handlebars, and other common template systems, Real Semantics applications refer to locations in HTML documents via class and id attributes as well as CSS selectors. Data can be merged with HTML by writing conventional procedural code or by applying matching rules.

The primary drawback of this scheme is that parsing, modifying, and then generating HTML requires more CPU resources than simply filling variables into a template. Despite this, it is possible to format thousands of documents per second on a single CPU core. The advantages are that we get a rational scheme for dealing with character escaping while retaining the option to apply formatting to text before it is inserted in the template. In addition, the system can read descriptive metadata from HTML meta elements in the HTML head element, extract structural metadata based on the h1-h6 elements as well as the new HTML 5 elements such as section, nav and aside, manage CSS and Javascript inclusions in the document head, make an inventory of hyperlinks, images and media objects, as well as add new features such as document transclusion by using existing elements such as object or by defining custom elements as allowed in HTML 5.

Prior art and decisions made

Earlier efforts in Literate Programming revolved around closely weaving program source code with documentation with the help of a macro language. Two kinds of compiler run against a literate program: one creates a program in an ordinary programming such as C or LISP, another creates input for a tool like TeX or troff which is turned into print documentration. It's a great idea, but it's had less uptake in the industry than documentation generation, exemplified by Javadoc, where annotations and comments in the source code are used to fill in templates to document, at the very least, the calling API of a software system.

Ted Nelson's Xanadu system is another inspiration to Real Semantics in that Xanadu allows complex linking structures between multiple documents and the creation of composite documents that combine content from multiple documents. Up until very recently, the Xanadu prototypes used unusual and complex data structures to implement linking, micropayments, and history management -- making it integrate with existing document management systems would be problematic. Interestingly, in work published in 2016, Xanadocs are implemented by creating an Edit Decison List that instructs the system as to how to assemble a composite document from bits and pieces of ordinary text files. This approach, painting semantic structure onto existing documents and document formats is exactly how Real Semantics works.

Real Semantics is first and foremost a system meant to extend a team's ability to manage complex systems and domains. On top of complexity essential to the problem, I believe there is an excessive amount of surplus complexity in computing tools. I wish I could help you cope with the complexity in front of you without teaching you anything new, but unfortunately I can't. However, I can help you make the most of what you already have, such as source code, javadocs, and specification documents -- and maybe you'll write more and better documentation when you have the tools to make better use of it.

Ideally, Real Semantics would be equally comfortable with documents of all kinds. For instance, if the boss writes specification documents in MS Word, let the boss keep doing that. Ultimately we can address the unlimited number of possible inputs by converting everything to use one of a limited number of standards, if only to make Real Semantics itself easier to use and understand.

HTML vs PDF

On wide investigation I converged on two candidates for a universal documentat format. One is HTML, and the other is PDF. We like HTML because:

HTML is widely portable. HTML documentation can be viewed on the web with computers, phones, tablets, televisions, game consoles, etc. HTML documentation can be packaged together with images and other media resources to create anEPUB files that works like a book. Alternately HTML can be transformed into PDF and printed or be used as an electronic book. Nearly all GUI frameworks have some ability to embed HTML text in both desktop and mobile user interfaces.
Text in HTML documents is well-structured in both theory and (much) practice:

HTML text is generally written in sequence: it is easy to keep the main flow of text separate from footnotes, page numbers, marginal notes, pull quotes and such with or without the cooperation of authors.
HTML documents tend to accurately represent tables, hierarchical lists, and similar structures
It is straightforward to embed links between parts of an HTML document and between that document and other documents
Through the use of &ltspan> and &ltdiv> as well as highly general mechanisms such as the name, id and class attributes, it is possible to accurately tag text with attributes and meanings.
HTML 5 adds new elements such as <section> and <sidebar> that help define the structure of documents, and provides clear guidance of how to add additional elements and attributes to the mix.

HTML can be written by hand and with a wide range of tools. Many other document formats can be exported to HTML and many of the documents we are interested are already in HTML.
HTML is used in Javadocs and is thus completely compatible with extraction of Javadocs from Java source code

PDF has some other good attributes:

PDF can represent documents generated by any system with high visual fidelity, even scanned documents that predate the computer age
Although PDF has been developed and championed by the Adobe Corporation it is an open standard (which unlike most ISO Standards, is available for free)
PDF readers and writers are widespread on a wide range of platforms
Most importantly, many standard and specification documents that we'd like to use are published in PDF.
PDF has a system, known as "tagging" that can represent the logical structure of a document and tell the difference between visual elements that are part of the text as opposed to hyphens, footnotes, callouts and such.

A critical problem with PDF, however, is that PDF documents are not universally tagged. In principle PDF documents generated from other formats (say Word) can be tagged, but the dominant paradigm in PDF conversion is to replace the print driver, which accepts geometry in a format like Windows GDI, PostScript, or PCL. Printing systems are all about putting ink in the right place and don't try to represent the meaning of documents at all, meaning an entirely different conversion approach is needed to created tagged documents. Another issue with tagging is that the PDF specification doesn't describe a complete and interoperable way to apply it -- the PDF/UA standard helps with this, but it is not universal as of 2016.

HTML vs Docbook, XSL-FO, and other XML Formats

Talking with technical publishers and people at various standards bodies, I found that many publishers use XML-derived formats. Docbook is particularly good for documents like the one you are reading right now. It turns out that quite a few technical documents that are distributed in PDF are written in Docbook or some similar XML dialect, transformed with XSLT into XSL-FO, and then rendered as PDF. It is also common to see XML documents processed using high end design tools such as Adobe Indesign.

Up until the definition of HTML 5, the case for XML vs HTML was more clear. Up until the definition of HTML 5, a major problem with HTML 5 is the correct rendering of ill-formed documents. HTML has never had strong validation, thus web publishers have always published broken documents. (For instance, documents where the opening and closing tags do not match.) Web browsers patch these up and render them anyway; since Netscape was the dominant web browser in the early 1990s, web publishers have come to depend on the specific (and undocumented) way that Netscape rendered broken documents. Microsoft was able to reverse engineer this, and implement this behavior in Internet Explorer, but others were not, leading to an interregnum in which many of us were afraid that Web browsing might someday become impossible off the Microsoft Windows platform.

Mozilla, which later became Firefox, implemented workable handling of ill-formed documents, bringing web browsing back into the world of open source. Although an XHTML standard was published that represented HTML content in XML form, which in theory would force documents to be well-formed, it had a number of practical problems.

HTML 5 is a major step forwards because it documents the handling of ill-formed documents which means that tools like JSoup can parse real-world HTML documents into a parse tree in a predictable way, much like XML documents. Practically, the feedback loop between writing a document and checking its visual representation ensures that the document structure can be captured, and if there any particular properties that the document structure should have, we can check that against the parse tree.

One strength of HTML is that it is tightly integrated with CSS and gives us the ability to mark text sections with CSS classes that can mark semantic aspects (for instance, concepts such as "city", "constant", "honorific" and "postcode") but also define aspects of presentation for print, screens, speech and other formats. Unlike the XSL-FO specification, CSS is continuously evolving, ensuring that we'll have better rendering options over time.

Which HTML 5?

Sublanguages

Both in specification and practice, HTML 5 is intertwined with several other languages:

CSS: Cascading Style Sheets
SVG: Scalable Vector Graphics
Javascript
MathML: Mathematics Markup Language

Cascading Style Sheets

Real Semantics extensively uses the HTML class attribute and CSS selectors to identify and format document portions. As much as possible, Real Semantics avoids the use of text templates and instead uses CSS selectors (usually using the class and id) to identify places of HTML documents that should be modified or filled with content from a database, RDF model or other source.

Currently Real Semantics does not interpret or process CSS documents, although it does manage them and package them together with HTML documents. In long term planning, however, I recognize that there will eventually be a need to transform CSS style sheets to support the ability to compose multiple HTML documents.

Specifically, if one wishes to mash together two HTML documents there is some possibility that these two documents will use the same CSS class for two different meanings, or worse, use the same id attribute for two different elements. This is the reason why languages such as Java and C++ support namespaces. Furthermore, if we are generating HTML and/or CSS programatically we need to make sure that we don't generate conflicting class(es) and id(s) and thus require something like the hygenic macro facility that is found in many versions of LISP.

(As an aside, many CSS preprocessors already exist, such as LESS and implementations already exist for the JVM. It may make sense to build on one.)

Scalable Vector Graphics

SVG fits well into the Real Semantics vision in that it provides a method for defining vector graphics. Efficient representations of vector graphics make it possible to scale to different screen sizes and pixel densities and can also reduce download times. Although it seems far more common for people to represent charts (time series, bar chart, etc.) and graphs (nodes connected via edges) using either a raster graphics format or a Javascript library, SVG is a practical way to render many algorithmically generated graphics. SVG supports less interactivity than Javascript does, but this may be a benefit -- people frequently fail to use interactive graphs properly, and a better understanding can be provided if graphs are curated.

An advantage of SVG over other alternatives is that, embedded in the HTML document, it is easy to do all the processing to create a document in one place. For instance, to generate a raster image file (JPG or PNG) you need to either create another file (if you are producing a set of static HTML documents) or accept another HTTP request (if you are serving HTML dynamically.) In the first case you need to create a unique filename for every graphic, in the second case, the data required to generate the graphic has to be made available when processing the second request. In either case, more moving parts are required, whereas inline SVG rendering is little different from rendering data in an HTML table.

(And the good news is that SVG is supported in all major browsers!)

Javascript

Unlike the other languages embedded in HTML 5, Javascript is a turing-complete programming language which doesn't fit entirely in the "document" paradigm. Although some single-page applications look like a document collection (and can be accessed like a document collection using tools like HTMLUnit) other single page applications, such as Cookie Clicker really don't have anything to do with documents at all.

Javascript fits in well with the document paradigm when it is used to add new features to HTML (often called a polyfill) or to render alternative document formats (such as PDFjs to render PDF or MathJax to render TeX, LaTex or MathML.) Effectively this lets us make HTML into the publishing language we want, but it adds additional complexity:

Javascript has the same global namespace problem as HTML and CSS: if two different Javascripts use the same global variables they can conflict, preventing the composition of arbitrary scripts in a single HTML document
If we are renaming id and class values to compose HTML documents, Javascript that refers to those identifiers and classes must be rewritten to use the new classes.
It's difficult (and impossible in general) to process Turing-complete programs in alternative ways. Although special-purpose solutions might solve particular problems, the ability to extract semantics from Javascript will always be limited.

Considering these factors, Javascript is something that Real Semantics will always hold at arms length. If necessary, it may be able to crawl data from single page applications, and it is likely to incorporate Javascript in pages that it creates for web browsers. In the middle of the application, where arbitrarily complex transformations may take place, it will work with a combination of HTML 5, CSS and SVG and avoid Javascript.

An exception

Recently search engines such as Google and Bing have advocated the use of JSON-LD to add semantic markup to HTML documents and this is something straightforward to embrace. This is done by writing something like:

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Person",
  "name": "John Doe",
  "jobTitle": "Graduate research assistant",
  "affiliation": "University of Dreams",
  "additionalName": "Johnny",
  "url": "http://www.example.com",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "1234 Peach Drive",
    "addressLocality": "Wonderland",
    "addressRegion": "Georgia"
  }
}
</script>

The key thing is that the language attribute of the script element is set to application/ld+json. This MIME type is currently ignored by conventional web browsers, but it is interpreted by both web crawlers and gmail. This is straightforward for Real Semantics to interpret, using the JSON parser built into the Jena framework -- and this mechanism may be used in the future to configure objects embedded inside HTML documents. JSON-LD, as a form of JSON, is a subset of Javascript which is not Turing complete and thus avoids many of the difficulties involved in processing general Javascript.

MathML

Although the previous three languages are supported in all major web browsers, MathML is not. That's unfortunate, because mathematics is an important part of technical documentation. Although MathML is supported in Firefox, Safari and Opera, it is not supported in Google's Chrome or Microsoft's Internet Explorer, which are the #1 and #2 desktop web browsers respectively.

The folks at Google have articulated reasons for dropping MathML; primarily the issue is that MathML is awkward to integrate in their rendering engine, and in an age where performance and security are paramount, it's a distraction from other goals. Given that MathML hasn't reached a critical mass of users, it seems that a $527 billion dollar company that could spare less than one full time engineer to work on the commonly used XVG standard can't afford to support MathML. (From my own viewpoint it's a bit bothersome that MathML includes two different languages, one designed to represent the visual appearance of math, the other to represent mathematical expressions in such a way that it can be processed by computers. Since the drafters were unable to choose one or the other, this duplication is an example of the kind of surplus complexity in computing systems that Real Semantics was built to counteract.)

As much as a standard for math rendering would be attractive, MathML is not widely supported so MathML development is not a priority for Real Semantics. It is possible to render MathML in mainstream browsers through the use of MathJax and other Javascript-based renderers, thus Real Semantics make incorporate MathML or some other math rendering language if and when it moves into applications where mathematics is a priority.