Data Refuge Rests on a Clear Chain of Custody

The documentation of a clear “chain of custody” is the cornerstone of Data Refuge. Without it, trust in data collapses; without it, trustworthy, research-quality copies of digital datasets cannot be created.

 

Libraries always say: “Lots of Copies Keeps Stuff Safe” (LOCKSS). That’s very true. But consider what happens if a faulty copy is made -- whether by accident or technical error or deliberate action--and then proliferates. Especially in a digital world, an epidemic can be the result. Instead of keeping “stuff safe,” we have spread lots of bad copies. Factual-looking data can in fact easily be fake data.

 

So how can we safeguard data and ensure that a copy is true to the original? Especially if the original is no longer available, we must find another way to verify the copy’s accuracy. This is where a clear, well-documented “chain of custody” comes in. By documenting this chain--where the data comes from originally, who copied them and how, and then who and how they are re-distributed--the Data Refuge project relies on multiple checks by trained librarians and archivists providing quality assurance along every link in the chain. Consider this extreme case: What happens if an original dataset disappears, and the only copy has passed through unverified hands and processes? Even a system that relies on multiple unverified copies can be gamed if many copies of bad data proliferate.

 

This practice of documenting whose hands have been on information goes back across hundreds, even thousands, of years. Instilling trust in information is a universal human concern. Unfortunately, it’s imperfect. The workflow devised for data refuge is similarly not 100% foolproof. But we can increase our trust in the copies by including librarians trained in digital archiving and metadata as the final instance of quality control before we make anything public. At this end link in the chain, we verify the quality with the Data Refuge stamp of approval.

 

Briefly, here’s how we verify the data:

  • After the data is harvested, it gets checked against the original website copy of the datay by an expert who can answer: “Will this data make sense to a scientist or other researcher who might want to use it.” This guarantees the data are useable.

  • Then, digital preservation experts check the data again, make sure that the metadata reflect the right information, and create a manifest of technical checksums to enclose with the data in a bagit file so that any future changes to the data will be easily recognizable.

  • The bagit files move to the describers who open them, spot check for errors, and create records in the datarefuge.org catalog, adding still more metadata.

  • Each actor in this chain is recorded. Each actor in effect signs off, saying yes, this data matches the original. And each actor also checks the work of the previous actor and signs off on it. This is the best way we have to ensure this copy is the same as the original, even if the original goes away.