Handling Complex Error Messages

Spring Error Messages

Here is a typical example of what software developers and operations people deal with all day.

The Spring Framework is one of the most popular Java Frameworks in existence. It solves the problem of configuration for software projects; any given piece of software has a number of things that need to be configured, such as where to find files and how to connect to a database. It's important, for security purposes, that this information not be baked into the source code. Practically, software developers working on the code need to connect to a database on their own computer, and when the software is deployed to a production server, we need to connect to a database on the production server.

Spring is intelligent in a number of ways, particularly, Spring starts with a model of the application it is embedded in, and based on that model, decides which order parts need to be initialized in, and when the application shuts down, it automatically shuts things down in the right order. As applications get larger, it is maddeningly hard to get things like that right -- because Spring assembles your application by configuring Java objects directly, you get a powerful configuration system for your application "out of the box", without writing any code, which would time-consuming, error-prone, and idiomatic to every application you build.

On the left is a minaturized screenshot of the log output when a Spring application failed to boot; we're showing just the first 1000 lines of output out of 4775, but the key thing is that the real error is described on lines 467 and 468, which are marked in red. The real error, it turns out, is that it is trying to connect to a database server that doesn't exist, which could very well be one of the most likely reasons why Spring applications fail to boot.

The error message case is painful, however, because starting an application that is not correctly configured for a database server is one of the most common things that can go wrong. It's likely to go wrong if a person has just downloaded the application and is trying to install it. It could go wrong at 3AM on a Saturday morning. It's obnoxious enough if it is seen by someone who is familliar with the architecture and code, but often it is seen by a fresher developer, a tester, an ops person, or an end user, who has little idea of where to look for the 0.04% of this error message is relevant -- and they're likely to be seeing it when they are facing a deadline, juggling a number of tasks, or fielding complaints from someone who can't get work done because the server is down. This kind of error message causes stress, hurts the quality of life in IT, and adds to a feeling that technology is out of control.

This type of "needle in a haystack" situation is common in intelligent systems work. For instance, somewhere in the nearly 2 gigabyte Enron Email corpus there is the evidence of a major crime, but it is buried in the middle of notifications about company software games, failing servers, seminars on how to use Microsoft Outlook, and guys chatting about what they did back in the Army.

In the Enron case, there is the difficulty of understanding not just the text, but the business domain, including topics such as commodity trading. The case of this error message is simpler, because the long error message comes out of the very intelligence of the Spring framework. The Spring framework is not primarily trying to create the database connection, but it is trying to create some other object, which requires it to create some other object, which requires some other objects, which ultimately requires the database connection. Spring knows the complete series of events, including the failures downstream caused by the initial failure (think, for example, of a nuclear meltdown that is initiated by some simple event such as the failure of a valve or an emergency generator.) The model of the application is there, and Java returns a detailed explanation of the failure in the form of a stack trace, but because Spring is incapable of looking at the application model and the stack trace together, it spews a 4775 line error message and lets the operator sort it out.

Not just Spring

It's not fair to single out Spring for this, since the same problem arises for many common ops and programming tasks.

The other day I was compiling a C program that uses GNU autoconf, a remarkable tool that inspects the environment it is running in, and automatically finds the libraries and header files to build an application. On Linux, the main effort involved is in adding any packages that are missing. In one case, I found the error message printed by the ./configure script for one missing package was cryptic, causing me to look in the log file, which contained, among other things, the C source code for a large number of programs that configure had tried, accompanied by compiler output. Once more, debugging the problem was a matter of finding a few relevant lines, then searching on the web to find out which package supplies the needed libraries. The ./configure script displays a complete trace of what it did, which aids in debugging, yet, in 2016, we are still bombarding users with entirely too much information.

How this problem affects Real Semantics™?

Real Semantics is a framework for building data-rich applications, and that means it interacts with many kinds of applications in its work. In a typical case, it creates a cloud server and installs a number of software applications on that server, together with packaged data, and then configures the applications to work together. Real Semantics uses Spring internally and frequently installs applications that use Spring, as well as other configuration frameworks.

Real Semantics greatly simplifies the construction of complex systems, because, working from a model, it automates many of the decisions involved in setting up and configuring software. It works quickly and repeatably, but errors still happen.

When you greatly expand the speed and scale at which people can work, you encounter new problems. Back in 2000, assembling all of the JAR files for a complex Java application was a difficult manual task -- you weren't going to include any dependencies you didn't need. Now with tools such as Maven and Gradle, it's easy to create a project that has thousands of dependencies, but then you have a whole new problem, managing the complexity of and interactions between thousands of dependencies.

The same is true for error handling. Applications managed by Real Semantics can fail in the build process or thereafter, and understanding the failure requires knowledge about Real Semantics, the application, and the environment in which it runs. Fortunately, error handling is a first-class concern for us, and we provide multiple facilities to improve error handling.

Mitigation Strategies for complex errors:

Universal Data Model: Because Real Semantics uses a single data model to represent all information, information anywhere inside RS or applications it builds can be brought together into one place and worked on with the same suite of tools. For example, it is difficult to reason about Exceptions with conventional programmig tools because they have no mechanism for doing queries or reasoning against Exceptions. Once we convert Exceptions to RDF/K data, we have access to the same facilities we have for other kinds of data and we can easily join them with domain maps, as well as application data and metadata.
Automatic fault detection and recording: When something goes wrong with an application managed by Real Semantics, it automatically captures any log files that are likely to contain error messages and brings the evidence into one place. In seconds, it accomplishes what would take minutes if you found and logged into the relevant cloud servers.
Multiple-model architecture: One reason that error handling is difficult in conventional software is that error handling is a cross-cutting concern; conventional practices that make software easier to develop also make it harder to deliver good error messages. If a compiler, for instance, is neatly layered, a problem could occur several layers after parsing and it might be unclear which exact code caused the error. Similarly, the InputStream is a useful abstraction inside Java, because it could hold data stored in a file, in memory, transferred over the network, or compressed, but a function parsing an InputStream has no idea where it came from.

It is impossible to build a complete and consistent logical model of everything in the world, but it is very possible to build models that represent different viewpoints of situations out of theories and tropes, and it's practical to combine these models to cover a domain, eliminating the conceptual gaps that arise with conventional methods such as relational databases and object-oreinted development.
Heuristic approach to fault analysis: Common problems in algorithms are objective, in the sense of there being a specific correct answer for a problem. For instance, if you add two integers, there is one correct answer. "Artificial Intelligence" problems, such as understanding a text, or making the best move in a chess game, are subjective in the sense that there isn't necessarily a correct answer.

Fault analysis is a subjective problem by it's very nature, for quite a few reasons. For one thing, it is an undecisable problem in a computer science sense; no computer program can completely understand the behavior of an arbitrary computer program. Secondly, fixing a problem is a matter of understanding the intentions of the user, which are not completely stated. We understand thus, that an error diagnostic capability doesn't need to be perfect, it just needs to be useful -- and to be customizable to address the problems that actually occur in practice. In the example above, the root problem is a misconfiguration of a database connection, an error that happens all of the time.
Case management: In the big picture, solving problems is a team function. It's best if we can do it right the first time. It's good if a person who experiences a problem can fix it themselves, and it's even better when we can help them. However, some problems need to get referred to a higher authority, which could be somebody technical, or it could be subject matter expert, somebody oriented around the business, or some combination of these people. No matter what, Real Semantics closes the quality loop, and makes it possible for intelligent systems to function "off the leash" by incorporating errors and other feedback into a closed-loop management system.