Paul Houle earned a PhD in theoretical physics in 1998 at Cornell University, pioneering a new way to do quantum mechanics. Up until then, quantum mechanical calculations were done in either position space or momentum space, with results that were confusing, unnatural, or ambiguous when linking classical mechanics to quantum mechanics. In his PhD thesis, he showed that calculations could be done in a manner that treats position and momentum equally; in future work, his thesis advisor's research group used this work to crack difficult problems in the science of "frustrated" systems that lack a single ground state.
I was fortunate to get a job in the summer of 1995 programming Java applets while Java still was in beta, developing educational applications on topics such as ferromagnetism and stress-driven crack propagation. In my thesis work, this progressed to simulation of both classical and quantum dynamics, the diagonalization of large matrices, and measuring volumes of shapes in high dimensional spaces. At the time I was a Linux enthusiast, proud to rush home and compile the latest kernel myself every week.
After a postdoc in Germany, my wife and I had an opportunity to buy a remarkable rural property close to Ithaca, also close to her large extended family. We live there now, where my wife teaches children to ride horses. In the years since, I've worked on a wide range of software projects for both a local and international clientelle.
In 2001, I worked with an international team to create Tivejo.com, a voice chat service for the Portugese-speaking (especially Brazil) market that gathered more than 400,000 users and ended with a sucessful exit. (Coming on the heels of the .com crash, we might have been one of the first "Lean Startups") Around that time I wrote a number of book chapters for Wrox press, including one in what might have been the first comprehensible book about XML (complete with a picture of my cat on the cover.)
I then worked forseveral at the Cornell University Library where I was the webmaster for more than 80 web sites including the popular arxiv.org preprint server. Around that time I got into a habit which seems quite unusual for programmers: that of "going native" and deeply studying the subject matters associated with projects I work on -- thus I picked up a great deal of knowledge about metadata from the viewpoint of Library Science. I worked on a multitude of projects, but I'm particularly proud of the Global Performing Arts Database which, to this day, has one of the most sophisticated metadata models in an academic archive. When I got involved, GloPad was in its third generation, based on a PostgreSQL database. With support for a large number of languages using a diverse character sets and thousands of controlled vocabulary terms it would have been a good fit for RDF, but at the time we made do with a semantic layer built on top of SQL.
My time at the library was marked by interdisciplinary collaboration: in addition to organizing campus-wide seminars on topics such as accessible web design, I played in active role in making valuable data sets available to algorithm and machine learning researchers, often negotiating the politics of data governance in a highly privacy sensitive environment. The result was a number of student projects and research papers in text classification, time series analysis (see this), and on Thorsten Joachim's learning STRIVER search engine, concepts of which have improved Google's search results. With so many high-traffic applications, I became a skilled at tracking down problems ranging from kernel bugs, misbehaving users, to obscure software bugs. At the library we were heavily dependent on vendors such as Sun Microsystems, Oracle, Paperthin, and Cornell's Central IT and, under the wing of sysadmin Surinder Ganges, I learned the fine art of getting the service that we pay for from vendors, a skill I find useful today with Amazon Web Services.
Funding became scarce at the library and next I worked for two consulting companies in the Ithaca area which pushed me outside my comfort zone in terms of using many commercially popular (and not-so popular) technologies such as ColdFusion, Microsoft IIS, Filemaker, and Tango. For the most part I whipped a number of partially completed projects into shape and got them in front of customers, coming to see how correct problem modelling was the key to success for "run-of-the-mill" business problems, such as:
After a year, I joined the Ithaca branch of a company based in Rochester that provided geospaital decision support products and services. I got involved midstream in the development of Proalign Web, a product aimed at companies that have thousands of sales representatives organized into territories. Such a company has tens or hundreds of regional managers, who use Proalign Web to collaboratively tailor territories to match customer demand to sales and service capacity. The product was based on Microsoft Silverlight, IIS, and SQL Server. Shortly after I joined the project, Microsoft made a big change: instead of being able to query the server and get an immediate reply, all communication became asynchronous, as it is in Javascript. It becomes difficult, if not impossible, to choreograph complex communications with the server in an asynchronous application unless you have a plan for how information flows through the application, and I rescued the project by creating such a plan.
At Mapping Analytics I worked with a strong team to deliver and display and do computations on geospatial data to the client application over the constrained environment of the web. Other projects included a touchscreen survey application and an intelligent application developed together with a data analyst that integrated with Salesforce.com to automatically assign incoming prospects to salespeople for a company with more than 15 million prospects in the automotive service space.
After the 2008 financial crisis, Mapping Analytics took advantage of a New York State program to furlough us on a four day workweek which left me time for moonlighting. I got interested in creating web sites that collected creative commons photographs and made money from advertising such as carpictures.cc and ny-pictures.com. Although I'd been exposed to the semantic web in my library work, and had also worked on a Rich Web Application semantic graph editor, my work on photo sites got me serious about the semantic web because I could use databases like Dbpedia and Freebase to make lists of topics, search for images on Flickr, and use Amazon's Mechanical Turk to confirm the identity of the photographs.
As my empire of sites grew, I attracted attention from companies on the West Coast and got an offer to work remotely for one called Xen (not the virtualization company) that was based in a house in the Hollywood Hills of Los Angeles. Xen was working on an intelligent social media aggregator that would let users have a great deal of control of their interest profiles as shown to advertisers, partners, and other users. For about a year I would spend about a week each month in L.A. and work from home the remaining time. We were particularly interested in extracting topics such as movies, actors, and bands from people's social media streams so I spent time developing a knowledge base and classification of popular topics as well as evaluating the state of the art of extracting topics from web-based text.
As that project wound down, I took a close look at Freebase, which was based on a proprietary database, and in a few months of work figured out how to convert their proprietary data dump into industry standard RDF that could be used in any industry-standard triple store. This marked my transition away from consuming RDF data with nonstandard tools to projects centered fundamentally around RDF. Opportunity knocked, however, when I got a job offer from another company in Rochester, this one with an advanced search engine based on "deep learning" technology.
Although it was a diversion from my emphasis on fact-based knowledge bases, my work at TextWise got me involved in the forefront of machine learning and, coincidentally, software engineering. Textwise had merged with IP.com and was developing a search engine for patents, a particularly good application for a state-of-the-art search engine. In many cases, search engine users don't have patience to look through more than a handful of results, limiting the potential for a search engine to expose the value in a document collection. Patent searchers, however, are exhaustively looking for prior art, and that is where the ability of a neural network to identify the "gist" is invaluable.
When I got involved in the project, the team had made great progress in bringing academic breakthroughs into a powerful search engine, but there was still a large amount of work to do to get the project ready for customers. From figurng out how to build the software reliabily, debug C++ and Java code simultaneously, run TREC evaluations on the engine, manage management expectations for the project, and get a large number of components working together, I finished a state-of-the-art product and got it in front of customers -- a search engine so good, and so much of an improvement over the competition, that the U.S. Patent and Trademark office called us days after the new search engine went live to build a license.
In the big picture, however, my passion is in systems based in rules and facts, and it seems to me that neural networks were an overcrowded field. As early adopters at Textwise, for instance, we worked hard to optimize the error-prone code for training neural networks in C++. Today, however, this work doesn't contribute to a unique selling proposition when vendors such as Google and NVIDIA are coming out with products like Tensorflow and CuDNN. Speaking with many data professionals who told me they spent 80% or more of their time cleaning data, and witnessing the difficulties that data experts have collaborating with software engineers to create products, I devoted myself to identify and fill the gaps organizations face in coupling conventional software to new "data-rich" technologies.
To that end, I began working with partners to understand the data analysis issues, performing a gap analysis around management practices, programming techniques, data modeling, machine learning, subject matter experts and business rules. I took advantage of my proximity to New York City to become familiar with the financial services sector, facing both fierce competition in a changing and uncertain regulatory environment. In this time I developed connections in the industry, intensely studied reference data and financial derivatives, and mastered the art of running both large batch jobs and web services with Amazon Web Services as well as packaging and delivering products for sale in the AWS Marketplace.
By Early 2016 this work came full circle, with the development of a comprehensive theory of how to get data in and out of legacy systems in RDF format. Although RDF standards such as RDFS, OWL, SHACL and SPARQL have many strengths, none of the standards address the semantic "gaps" between legacy data (ex. relational, XML, JSON, vCARD, HL/7 or EDI) and a flexible graph representation which lets us look at data from multiple sources side-by-side with a single set of tools. That was the beginning of Real Semantics, the Ontology2 framework and RDF/K, the Ontology2 extension to RDF. My experience, tools, and network of partners is available now to add to your top line by satisfying customers, connecting to opportunity, dodging risks, and complying with regulation.
Subscribe to my weekly newsletter that uncovers the art and science of intelligent applications.
|