Blog

You are here

Matthew Addis's picture
Matthew Addis

Wash Day for Your Data

How often do you scrub your data? Scrubbing means checking that the data you are storing is in good condition and fixing any problems you find.  Just like my kids, data doesn't seem to stay clean for very long.  Data needs regular cleaning too, to make it all clean and shiny again. Scrubbing is wash day for data! 

As I've blogged before, no storage technology is perfect, there is always something that can go wrong. And if it can go wrong, then it will go wrong. Bit rot, media failures, software bugs, human errors - they all cause data corruption or loss if left unchecked.  Scrubbing is one answer to this problem. You'll find a lot of terminology to get to grips with when it comes to scrubbing, but it all amounts to the same thing:

1) You need a way to measure if your data has changed in any way. Some people refer to as “fixity” and others as “data integrity”.  Checksums, hashes and fingerprints are terms used to describe ways to measure fixity. There's a great explanation on the Library of Congress Signal blog.

2) You need to actively check the data to pick-up any problems. Some people call this a fixity audit and others call it integrity checking. It's important to do scheduled checks as you can't always rely on storage systems to do this for you, or to get it right all the time.

3) You need some way of fixing any problems that you find.  Here the most common approach is to have some form of redundancy in the data. This means that if one part of the data is damaged then it can be corrected using the redundancy in the rest of the data. 

Like learning your ABC’s, scrubbing is as easy as 1,2,3 (I'm sure there's a song in there somewhere). You checksum the data when you first store it so you have a reference point for what “good data” looks like.  Months or even years later you check that the checksum hasn't changed by retrieving the data and making a new checksum to compare against the reference one. If there is a difference then you know you now have a problem that needs fixing.

There's nothing new in this sort of approach. Every time you use a hard drive, DVD, data tape, RAID array or nearly any other form of IT storage, behind the scenes there is some form of data checking and error recovery happening every time you ask for your data back. All these techniques work by adding some form of redundancy into the data, which means storing a little more than the bare minimum bits and bytes so that if there are errors then the problem can be corrected using “spare” data.  This is the world of error correction codes.

The problem is that no matter how good the intentions are of storage system designers and manufacturers, no storage technology is ever perfect - and neither are the people who install and run it either.  So what do you do?

You either accept the risk or you add more redundancy and you do some scrubbing!   The simplest way to add more redundancy is to make more than one copy of the data.  And you store these copies in other locations, ideally on other technologies.  In the ideal world you even get other sets of people to manage the different copies. This is the “not all eggs in one basket” approach that uses replication and diversity to minimise the overall chances of data loss.  Again there's another good post from the guys at The Signal on why this is a good idea.

So, how many copies do you make?  How often do you check them?  How much does it all cost?  What are the chances of data loss? 

As a simple example, imagine your archive keeps data tapes on shelves. Never ever checking the tapes is a bad idea. But checking them too often can be a bad idea too. Each time you check a tape it takes time and money and there's a risk of damaging or losing it (you drop it, the tape drive eats it, you put it back in the wrong place). So where's the happy middle ground? 

Nearly five years ago Richard Wright from the BBC set a challenge in terms of what an archive might want. He asked for a cost curve for the percentage of loss of archive content over 20 years, including the probability of loss, the size of the uncertainties, and a cost function to show how the probability of loss varies against increase - or decrease - in investment.

Understanding the trade-off between cost and risk of loss by making copies of data and scrubbing them is one small part of answering this question. We've been using our PrestoPRIME modelling tools to look at how to create and present cost/risk diagrams that 'map out' the choices that can be made. If you want to know more about the diagram below and how we have approached the problem of providing quantified costs, risks and uncertainties for long-term storage of file-based assets using IT storage technology, then please head on over to our modelling website and have a look at this article.

Cost v.s. risk heatmap created from using the PrestoPRIME modelling tools