Call it data warehousing or data science, but after decades, data projects still spend 80% of effort on data preparation and cleaning; no advances in the other 20% can have a real impact on the economics of data management. Enterprises run on message passing architectures centered around document exchange and are rapidly filling Data Lakes together with Word, PDF and other human-readable documents; with thousands of data sources existing in just one enterprise, interacting with other enterprises mediated by data pools such as the LEI and BBGID systems. For the sake of development agility and operational sanity, we must industrialize the process of data matching and conversion.
Our technology converts common formats such as XML and JSON to RDF graphs with or without an existing schema. Business rules, automatically generated and manually refined, reshape the data into a canonical model. Additional rules and models can look at terms that appear in a document, such as "Germany" or "PL" and link these to well-defined concepts and entities such as businesses, customers, accounts and currencies.
The system works together with humans in a process to improve the knowledge base with ability to specify rule-base constraints that can immediately fix critical errors and the statistical methods that will focus human attention on the areas where help is most needed. Business needs are handled through an integrated case management system.
We see no fundamental difference between extracting facts from XML files and extracting them from full text, so our product contains a text-analysis framework complete with training facilities that support both rule-driven and machine learning