IBM, Jeff Jonas and A Better Data Model
It now seems a long time ago that I discussed data models with Jeff Jonas, at an IBM event in Portugal. I remember having a couple of conversations with him, because he had sparked my curiosity. In a presentation at the event, Jeff had proposed the radical idea, that “a query is itself data”.
This proposition is not really up for debate. The fact that someone poses a query or that some application generates a query, is clearly information about an event of some kind and the query itself may be highly relevant information about an event. A simple example: if I pose a query to a security system about an individual it means that I had need to check their security credentials.
Jeff was the founder and chief scientist of Systems Research & Development (SRD). before IBM acquired the company and turned it into the IBM Entity Analytic Solutions (EAS) group. I’ve written two white papers on the technology and I’ve followed its evolution, precisely because Jeff (and the technology he is evolving) take a different view of data than is standard.
For example, most people would expect that if two data records about the same thing vary, then one is probably inaccurate. This is certainly a possibility. However another possibility is that they are not, they simply belong to different contexts. As a simple example, Jane Smith marries John Davis and her name changes to Jane Davis. If one record shows the old name and another the new, there is no error, just a difference of context.
Even if data is in error, the fact of the error may be significant. This is certainly the case when criminals invent aliases. The data that bears the alias may be in error, but it is a deliberate error invented to hide rather than reveal an identity. IBM’s EAS software is designed to expose such information, and is used to by casinos to try to detect fraudsters and by security organizations to try to identify terrorists and other bad guys.
The Error of the Update
Something that I’ve pointed out once or twice in various presentations I’ve given on the topic of data, is that the IT industry made an error in its handling of data almost from the moment it was born. A completely false mechanism was introduced into software; the update. It is easy to understand why it was introduced. In the early days each byte of data cost money to store. The idea of storing older values of data items as their value changed was simply not affordable. But it is affordable now and yet we continue to update data.
No data item should ever be updated. Its new value should be recorded and its old value retained. Then we have a full audit trail of everything. We have a history of values and even a knowledge of when they changed.
This is not the only issue I have with data, but it is a fundamental one that is worth airing here – as I intend to discuss a few other problems that I have with data, in the hope of accumulating a useful resource.
Something quite limiting happened to the world of data when Ted Codd proposed his less than wonderful “relational theory of data”. Some people went as far as to suggest that it was a provably correct theory (mathematically) when it was no such thing. What it was was an attempt to represent the universe of data via a mathematical model, that quickly proved to be incomplete and semantically bankrupt. Poor Ted believed that you could apply the mathematics of Ring Theory to data in a useful way, and it only turned out to be half useful.
In his later years, he started inventing rules for OLAP, in the hope of repeating an earlier victory in his life, when people took his rules of relational database seriously. OLAP, by the way, was just one illustration of why the Relational Theory of Data was inadequate.
Where am I going with this? Well it’s about time we took a fresh look at the field of data; one that is practical and which might yield something more meaningful than a 5th Normal Form data model.
With any luck, I’ll get a dialog going with Jeff



















