What is Data Provenance & Why is it Important in the Clinical Diagnostics Space?

In clinical diagnostics labs, there’s a pressing need for data provenance—tracing the origin and changes over time of critical data such as electronic health records, analytical results, and workflow records.

Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability, or trustworthiness. – The W3 Consortium, PROV-Overview

Data in laboratories is often produced by separate information systems, which can make it difficult to trace. When the systems aren’t connected, it can be even more challenging to maintain the data’s chain of custody and the metadata records.

The result? Stakeholders can’t generate a report that describes exactly what happened to a sample, such as which agents1 and activities2 were involved, and at what times. This information about the data—the metadata—provides granular details that might have been collected in a notebook or file previously.

For example, in order for a lab to produce a full provenance record on a measured volume, they need to prove how they know that a certain volume was measured. They also need to be able to answer questions such as:

  • Who recorded that volume?
  • When did they record it?
  • Where was the measurement performed (at which facility)?
  • What system and instrument(s) were used?
  • Why were they measuring it? Was it part of a standard operating procedure (SOP)? What was the name of the SOP step that they were performing?

Answering these questions efficiently requires that metadata for each specimen be recorded and stored in a standardized way so that it can be easily reviewed. While no regulatory bodies (such as the FDA, CAP, or CLIA) currently dictate how this metadata is captured for clinical laboratories, once data provenance becomes better supported in modern laboratory software, we predict that detailed traceability will become required in clinical software. We recommend preparing for this sooner rather than later.

Why is data provenance so important in clinical diagnostics?

There are a number of reasons why labs should place a high priority on addressing data provenance. For instance:

  1. Auditability, transparency, and trust in the software are critically important for labs dealing with sensitive personal data.
  2. Regulations require that private patient data and laboratory records are handled securely and tracked in case followup is required. For NGS analyses, the College of American Pathologists requires that all information used to process a patient sample—such as reagents, primers, sequencing chemistries, and platforms—be documented so that details can be extracted. These could be thought of as the “agents and activities acting upon” the patient data in regards to data provenance.
  3. Tracking provenance can help with the interpretation of data and ensuring its trustworthiness.
  4. Provenance can be used to help labs analyze whether processes were performed efficiently.
  5. Clinical research relies on collaboration and reproducibility. Data provenance supports this by providing all the data necessary to reproduce the lab’s findings.
  6. Patient safety is critically important. If a serious adverse event occurs, involved laboratories might need to perform a post-hoc investigation to see how a result they produced could have contributed to this event. Consider how well your laboratory software stack might support this type of investigation.

Looking ahead

In an ideal world, laboratory informatics systems would be able to generate and interact with data that adheres to provenance standards, such as W3C PROV.3 What we’d like to see, eventually, is the ability for labs to immediately access all metadata records linked to a sample directly from within the laboratory information management system (LIMS). Unfortunately, that’s not possible yet using an off-the-shelf LIMS. There are a lot of obstacles to overcome before a universal provenance standard is adopted and all healthcare data formatting is harmonized.

However, in the meantime, labs can work with a software consultant to integrate the various components of their informatics systems to provide more robust data provenance. When you’re selecting a new vendor or consultant, be sure to confirm that they understand the importance of data provenance as a functional requirement in software in the field of clinical diagnostics. Custom clinical software should always be built with data provenance in mind.

Ontology4 is a related concept, which we’ll explore in our next post.

1 An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent’s activity.
2 An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities.
3 The World Wide Web Consortium (W3C) has created PROV, a set of recommended standards, to support the interchange of provenance information on the Web.
4 An ontology is a formal naming of a set of concepts (similar to a dictionary) and the relationships between them that helps provide context to the data. It’s tied closely with provenance—where data comes from and what happened to it. Published ontologies help structure data by connecting the individual pieces.