During the 2008 Financial Crisis, the industry and regulators were in a state of fear and ignorance about the total amount of money owed by derivative markets participants, an amount, on paper, that could have been greater than the U.S. GDP. As a result, the G20 began the process to create a global Legal Entity Identifier which is overseen by the Global Legal Entity Identifier Foundation.

Ontology2 has long been a leader in the commercialization of large-scale semantic databases such as Dbpedia and Freebase. Replacing imprecise names and addresses with specific identifiers is what we do, so we took an interest in the LEI system and started in August 2014. Our LEI site is what we call an interpretive browser in that it combines a raw information source with additional information, rules and machine learning models to supplement that data with expert commentary.

The is a product of the Real Semantics framework, both in the sense that it incorporates parts of Real Semantics and in the sense that the Real Semantics framework assembles the software components and data that make up the site. This article, which is part of the Real Semantics documentation, explains the unique attributes of the the site the relationship it has with Real Semantics. It is aimed at a wide audience, which may or may not be knowledgable about reference data, the financial industry, semantic technology or cloud computing, so we will start with the business case and explain how we apply technology to it.

About Legal Entity Identifiers

A Global Perspective

At this time, the LEI system is driven primarily by the requirements of government regulators. That's a big topic because there are a large number of government regulators in many different countries and jurisdictions. Many of the challenges faced by the system are rooted in the fact that it is international and has to reflect very different conditions in different parts of the world.

For instance, in the United States, there are a large number of federal financial regulators that have different domains. The Securities and Exchange Commission regulates stocks and bonds in the US, while the CFTC regulates options and futures. The FDIC, Federal Reserve Board and Office of the Comptroller of the Currency play a role in the federal regulation of banks, together with the National Credit Union Administration regulating credit unions, as well as the Consumer Financial Protection Bureau and the Federal Financial Institutions Examination Council.

So far as banks and insurance companies are concerned, all 50 states charter and regulate financial institutions as well. All businesses in the U.S. are registered in a state or territory, and the rules of that registration are quite different in different states. For instance, if you form a corporation in New York you need to specify a board of directors (which is a matter of public record) and have an annual meeting which is documented with the state. In nearby Delaware, a single person can register a corporation and nothing about the people involved is in the public record.

Now the U.S. is a large country where states have an unusual amount of autonomy, but you find large variations in regulatory regimes in various countries, particularly when transparency and privacy are concerned. In the UK, for instance, businesses are regulated with Companies House; which makes a large amount of information available on their web site and via an API. Other countries, like the Cayman Islands and Luxembourg, value privacy more than transparency, and publish very limited information, like the state of Delaware. Such "corporation havens" are chosen by registrants on a global free market, but countries that have a different view of the relationship between states and business such as China and Saudi Arabia which see the activities of state-owned enterprises as state secrets.

The result of all this is that the LEI system is a compromise between a number of interests, and largely reflects a "lowest common denominator" of what is available. Relationships between entities (fund/administrator, subsidiary/parent, or branch/master) are an important requirement for risk data aggregation, but these are very much a work of progress because of cross-jurisdictional issues.

Regulatory applications of the LEI

CFTC Swap reporting

The first application of the LEI for regulation was for the CFTC swap reports, which led to the requirement that organizations that trade OTC derivatives in the US register for LEIs. Swaps are financial tools used for many purposes, such as converting a variable interest rate into a fixed interest rate. A major category of swap is the credit default swap, which can be thought of a form of insurance you can buy on a bond. When Lehman Brothers collapsed in 2008, more than $400 billion worth of swap contracts existed on the company, an amount greatly larger than Lehman's debt.

Things turned out to be less dire than expected, however, because most swap market participants held positions that canceled. Swaps are a bit different from stocks and bonds, because you can't just "buy" a swap and them "sell" it the way you would with a stock when you don't like the price. Instead, dealers and traders enter into an position in the opposite direction which cancels out the original position. Once positions were netted out, the amount that changed hands was a relatively modest $7.2 billion. Most swap dealers did a good job of managing risk in their swap books, thus protecting the system, but AIG notoriously sold CDS derivatives without hedging the risk, leading to a $180 billion bailout.

It took quite a while to figure all this out, leading to a large amount of fear and paralysis on the part of the industry, and the CFTC swap reports were introduced to provide clarity in case of a repeat. The idea here is that transactions made at swap execution facility are submitted one of several swap data repositories (SDR.) The data sent to the SDRs uses the LEI to identify the parties to the transaction as well as other entities involved (such as the issuer of bonds insured by a CDS.) SDRs in turn send summary data to the CFTC.The summary data is of limited use because it is not netted (for instance, two parties trading with each other could create a large visible volume without incurring any liability) but in the event of an emergency, regulators can "break the glass" to look at individual trades and positions to get a global view.


Today, Europe is taking the lead in the adoption of the LEI. EMIR targeted over-the-counter derivatives and, among other things, developed a swap reporting system similar to that in the U.S. MiFID 2 addresses exchange traded securities and public corporations; it represents a huge advance in reporting standards, introducing both the LEI as a standard identifier and the creation of machine-readable XBRL reports as pionereed in the U.S.

Other applications

These will take time to develop, but in most cases where legal entities interact with the government, the LEI could be a useful identifier. For instance, businesses that contract to provide goods and services with the U.S. government are required to get a proprietary D-U-N-S number. Federal agencies don't have access to the full D-U-N-S database, so this gets in the way of doing analytics about procurement. It has been proposed that the LEI be used as a replacement for D-U-N-S.

A while ago I worked on a search engine for patents and got an insider point of view of the difficulties of that domain. Patents are issued to individuals but are frequently assigned to corporations; if you look at the "asignee" field, however, you find that this field is completely uncontrolled. Our data science team found that there were more than 1000 ways that people wrote "IBM". We were able to match these records with considerable effort, but certainly the results were imperfect. It has been proposed, similarly, that the LEI be used to identify the assignees of patents.

An industry perspective

To fill out the story, the financial industry has it's own motivations to get better control of reference data. The folks at Financial InterGroup call this the "Barcodes of Finance".

The key idea is that behind every trade and customer facing interaction (front office), a bunch of work has to be done to document and finalize the transaction (back office). Historically, and often today, back office work involves manually filled forms with imprecise, natural language names and identifiers. It is necessary to track legal entities, financial products, and trades, and with a sufficiently accurate and fast system shared by all participants, straight-through processing becomes possible.

Straight-through processing has vast benefits, including reduced cost, greater speed and a reduction of clerical errors. The long time involved in international transfers is itself a source of risk. As of mid-2016, there is a large interest in using blockchains to automate payment and trading systems, but whatever the underyling technology is, correct and precise identification of who is participating and what they are trading is essential to any improvement or reform. So long as the LEI as seen as an imposition that comes from regulators, the quality of data is always going to be the least that people can get away with -- thus, meeting industry needs is the direct path to satisfying the needs of regulators. Over the long term we look towards improvements in the LEI system that enable an increasing number of private and public applications.

The architecture of the Legal Entity Identifier System

The architecture of the LEI system reflects the political structure, in that businesses register with one of approximately 30 Local Operating Units (LOU). Every day, each LOU publishes a new data file that lists the entities registered with them, and the GLEIF concatenates these into an official file. Every day, we download this file and rebuild the site. Because the GLEIF permanently retains each data file, we can do a bitemporal analysis that reveals both changes in the data over time (such as companies transitioning between different states) and transient errors that sneak into a daily file and that are quickly fixed.

The GLEIF checks that data from the LOUs validates against an XML schema but does little beyond that to check the validity, correctness, or completeness of the data. Current official data quality reports data quality reports report that LEI data has a high rate of conformance (upwards of 98%) at what they call "maturity level one" but do not address deeper issues. (For instance, more than 20% of LEI records are in the LAPSED state because they have not been renewed by their registrants.)

The demonstrates how quality improvement can be done from outside the system. Free from politics, we can call attention to issues that the GLEIF cannot, such as companies that are registered in "corporation hotels" such as the notorious 1209 Orange; a practice that degrades the meaningulness of corporate registration even if it is legal and allowed. Our processes discover data quality problems, both in individual records and the adequate, and we pursue the LEI challenge process to fix defective records at the LOU.

LEIs versus other business databases

Legal Entity Identifier records seem to be a lot like other kinds of business databases, such as the National Provider Identifier or the D-U-N-S number. One difference, however, is that Legal Entity Identifiers are issued for traditional business organizations as well as government organizations (such as the City of Pasadena) as well as investment funds and trusts, such as mutual funds and pension funds.)

One aspect of that is that LEI records intersect poorly with other business listings. D-U-N-S records, for instance, cover more than 250 million business entities but do not list investment funds. This puts a cap on the percentage of LEI records that can be matched against D-U-N-S, so if you want to match a higher percentage of LEI records, you need to also match against a database of security identifiers such as CUSIP or BBGID. On the other hand, relatively few business organizations have an LEI at this time -- an analysis of more than 10,000 queries made against the site shows that many individuals try to look up their own organization and are often surprised to learn that (i) their business does not have an LEI, and (ii) that their business needs to register to get an LEI.

The site and Real Semantics

henson: Realizing the Power of Cloud Computing

Real Semantics includes a module called henson, after the famous puppeteer, which is in charge of creating infrastructure in the cloud. Many older generation semantic frameworks, such as Information Workbench addressed the understanding of IT infrastructure as it exists. With henson, Real Semantics can not only understand IT infrastructure, but it can create IT infrastructure. Unlike other big data frameworks which require a large investment in cluster hardware, Real Semantics can run on any ordinary workstation or server and provision practically unlimited resources in the cloud. The following diagram explains the process by which the site is built:

Practically, henson performs two functions with respect to the LEI site:

  1. Henson creates a machine image that contains the operating system as well as software that the LEI site depends on. This includes installing AWS client software, the Java Runtime Environment, and a number of minor software packages. When possible this is done by using the apt-get packaging system, but the system is also capable of downloading, compiling and installing open source and other software. This process is typically done once a week.
  2. Because a new LEI file is released every day, we build a new LEI server every day. Henson launches a new cloud server based on the previously created machine image, and compiles a bash script that runs when the server starts up. This bash script updates the Real Semantics software on the LEI server, and starts a mogrifier job that combines data from a number of sources with a set of software components to create the new site. After the site passes automated integration tests, the new site takes the place of the original server and the old server is destroyed.

This approach has a number of advantages over conventional approaches:

mogrifier: data-driven data transformation

Henson sets the stage on which the site is constructed by creating and configuring a server, but the rest of the construction work is done by the mogrifier, another component of Real Semantics.

A modern high-performance application typically contains more than one "database". For instance, although many relational databases and triple stores contain a full-text search engine, typically you get better relevance using a dedicated search engine such as Lucene. If you want autocompletion or typeahead search, you'd do best with a specialized index such as Cleo. Intelligent systems could contain a curated database of facts and rules or a machine learning model involving random forests or neural networks. No matter what, data-rich applications contain raw data, pre-computed results, and specialized indexes that power fast and accurate interactive queries.

The mogrifier starts with a map of the process that transforms raw data into the databases and indexes required to power a product. In the case of the site, this map looks like this:

There are quite a few products on the market which can construct a dataflow from a GUI map in this way, such as LabVIEW, Alteryx, and KNIME. Real Semantics does not include a visual editor at this time, but it has a few features that other products do not:

He were will show a sample of the job description for the LEI site so you can get some idea of what it looks like in the RDF Turtle language. This sample loads data from the LEI file supplied by the GLEIF, converts it to RDF, sends it to the triple store, and indexes a few fields in Lucene:

@prefix : <>
@prefix g: <>
@prefix e: <>
@prefix es: <>
@prefix lei: <>

:Mogrifier :scanPackage "com.ontology2.definancialization.artifact" .
[] a :LEIReaderArtifact ;
    :name e:reader .

[] a :DatasetArtifact ;
    :name e:tdb ;
    :path "tdb" .

[] a :GraphArtifact ;
    :name e:lei ;
    :containedIn e:tdb ;
    :graphURI g:lei ;
    :input e:reader .

[] a :LuceneArtifact ;
    :name e:fulltext ;
    :input e:reader ;
    :path "lucene/lei" ;
    :field [
        a :FieldDefinition ;
        :property lei:RegisteredName ;
        es:index "analyzed";
        :fieldName "name"
        a :FieldDefinition ;
        :property lei:RegisteredCountryCode ;
        :fieldName "country"
        a :FieldDefinition ;
        :property lei:LegalEntityIdentifier ;
        :fieldName "lei"
        a :FieldDefinition ;
        :property lei:LEIAssignmentDate ;
        :fieldName "assigned" ;
        :transform <java:com.ontology2.rdf.mogrifier.fn.ZuluDate>
    ] .

Rather than covering everything in detail, I'll point out a few major points:

  • The mogrifier is tightly integrated with the Java programming language and runtime environment:
  • Note that the job definition doesn't care about the order that artifacts are defined. Instead, it analyzes the linkages between artifacts made through fields such as :input to determine the order of dependencies, much like the famous make utility. Often the system can take advantage of parallelism between steps: for instance, it can use different threads for parsing, search indexation, and writing to the TDB database. It operates efficiently, parsing the file once, and sending the result to multiple places. Once the TDB database is complete, the reports and record matching can then be run at the same time.

    For maximum flexibility, the mogrifier is divided into several parts including:

    1. Artifact Factory: each Artifact consists of multiple Java objects. For instance, the LuceneArtifact produces one object that indexes RDF graphs when the database is being built, and another object embeded in the web site to support search. An Artifact can also create different implementation objects based on context, such as one designed for local execution and another designed for a scalable cluster. The artifact factory manages the creation and configuration of all the objects needed in the execution.
    2. Control Plane: once a job is compiled, an RDF graph and set of Jena Rules is built to control execution. The job is defined into a number of flows, and when the prerequisites for a flow are met, a rule fires to trigger execution of the flow. When the flow is complete, facts are inserted into the control graph which trigger more flows until execution is done. The control plane manages the initialization and teardown of all the artifacts so that everything is left in the correct state.
    3. Data Plane:The control plane does its work by sending commands to the data plane, which is interchangable. The default implementation of the data plane is based on the Reactor framework from Pivotal which is incredibly efficient for jobs that run on a single workstation or server and also supports flows that work on a streaming basis, as well as batch. The organization of the data plane is quite similar to Spark, Flume and other big data frameworks so there is a clear path to scaling for cluster execution.