Sunday, October 23, 2005

Data Provenance

In my previous post on Information Sharing, I discussed some of the problems of information sharing in a loosely-coupled world, with special reference to the security services. There is further discussion of the social and political aspects of this at IntoTheMachine and TrustBlog. In this blog, I am continuing to focus on the information management aspects.

On Friday, I had a briefing about an EU research project on data provenance, led by IBM. This project is currently focused on the creation and storage of provenance data (in other words data that describe the provenance of other data - what we used to call an audit trail). The initial problem they are addressing is the fact that provenance data (audit trails and logs) are all over the place - incomplete and application-specific - and this makes it extremely hard to trace provenance across complex enterprise systems. The proposed solution is the integration of provenance data from heterogeneous sources to create one or more provenance stores. These provenance stores may then be made available for interrogation and analysis in a federated network or grid, subject to important questions of trust and security.

In art history, provenance means being able to trace a continuous history of a painting back to the original artist - for example proving the version of the Mona Lisa currently in the Louvre is the authentic work of Leonardo da Vinci. As it happens, we don't have a completely watertight provenance for the Mona Lisa, as it was stolen by an Italian artist in 1911, and remained absent from the Louvre until 1913. Most art lovers assume that the painting that was returned to the Louvre is genuine, but there is a gap in the audit trail in which an excellent forgery might possibly have been committed. [See The Day The Mona Lisa was Stolen. I learned about this story from Darien Leader's book Stealing the Mona Lisa.] Provenance may also involve an audit trail of other events in the painting's history, such as details of any restoration or repair.

In information systems, provenance can be understood as a form of instrumentation of the business process - instrumentation and context that allows the source and reliability of information to be validated and verified. (Quality wonks will know that there is a subtle distinction between validation and verification: both are potentially important for data provenance; and I may come back to this point at a later date.) Context data are used for many purposes besides provenance, and so provenance may involve a repurposing of instrumentation (data collection) that is already carried out for some other purpose, such as business activity monitoring (BAM). Interrogation of provenance is at the level of the business process, and IBM is talking about possible future standards for provenance-aware business processes.

Provenance-awareness is an important enabler for compliance to various regulations and practices, including Basle2, Sarbanes-Oxley, and HIPPA. If a person or organization is to be held accountable for something, then this typically includes being accountable for the source and reliability of relevant information. Thus provenance must be seen as an aspect of governance.

Provenance is also an important issue in complex B2B environments, where organizations are collaborating under imperfect trust. From a service-oriented point of view, I think what is most interesting about data provenance is not the collection and storage of provenance data, but the interrogation and use. This means we don't just want provenance-aware business processes (supported by provenance-aware application systems) but also provenance-aware objects and services. Some objects (especially documents, but possibly also physical objects with suitable RFID encoding) may contain and deliver their own provenance data, in some untamperable form. Web services may carry some security coding that allows the consumer to trust the declared provenance of the service and its data content. There are some important questions about composition and decomposition - how do we reason architecturally about the relationship between provenance at the process/application level and provenance at the service/object level?

We also need an analysis and design methodology for provenance. How do you determine how much provenance data will be adequate to satisfy a given compliance requirement in a given context (surely standards cannot be expected to provide a complete answer to this) and how do you compose a suffiently provenance-aware business process either from provenance-aware services, or from normal services plus some additional provenancing services. Conversely, in the design of services to be consumed outside the enterprise, there are important analysis and design questions about the amount of provenance data you are prepared to expose to your service consumers. The EU project includes methodology as one of its deliverables, due sometime next year. In the meantime, IBM hopes that people will start to implement the provenance architecture, as described on the project website, and provide practical input.