For several years now I’ve been following strong practices for the safe and secure backup of my family’s digital photos and video. Much of what I do comes from The DAM Book by Peter Krogh and the good people of The DAM Forum. DAM in this case stands for Digital asset management. After each session of cataloging my digital assets I take a backup to an external drive which is periodically swapped with an offsite duplicate.

This was all working fine until one day when I decided to create MD5 hashes of my original files for comparison to the backups. Although my backup software had never reported any errors, the files were different.

In short, software looks at every bit of a file and generates a signature string unique to a file. If, at a later date, the signature is recreated and turns out to be different the file has changed. Although two files can generate the same MD5 result, for the purposes of checking if a file has changed during copy it is fine. Windows uses a basic check called CRC which is faster but not as robust. When I checked the backup MD5 digests against those of the original files there were differences. Approximately 10 of my 16,000+ files were different. Zero is the only acceptable number. Each file showed the same size, date and time as the original but the MD5 signature proved differently which is why the backup program thought everything was fine.

The Problem

Calculating MD5 digests adds an overhead to the copy process. A one step copy blows out to three steps. Calculate signature, copy and recalculate signature to check. This takes time. Most MD5 programs can create individual files, one file per directory, or a single multi-directory file. My preference is one per directory as it provides the best balance. It also means that if I copy a directory to DVD for backup I can easily copy the MD5 file with it.

On my dual core PC running 16,000, often 10+mb files takes hours. More so as the PC will often crash before finishing (another problem for another day). I don’t want to waste time recreating what I’ve already done.

There are then two criteria to be met when using checksums for backup validation. Calculate checksums fast and only when needed.

The Solution

My solution is a combination of checksum creation and validation software, managed by scripts to provide fine grained control and efficiency. If it’s done, don’t do it again was the motto for this project.

This is a Windows PC based solution, but the logic will apply for any operating system.

To create the MD5 signatures I use ExactFile, and specifically the exf command line version. MD5 programs abound. ExactFile provides additional speed by making proper use of the multiple cores in today’s modern CPUs. On a beefy test sample ExactFile ran four times quicker. The application version is the best way to see this in action.

For syncing files between drives I use the freeware version of SyncBack. It helps maintain an identical list of files across two drives (though as we know the files themselves may not be identical). SyncBack does provide an option to generate a MD5 signature for each file for copy validation but lacks the control to make it efficient. It runs over every file, every time.