Useful Numbers
We define useful numbers as those which have properties that
make them useful. We're all familiar with π and e because their values represent
things in the real world and are used daily in many professions. In fact, they
are so useful, it's highly likely that you know both their meaning and
approximate values. Similarly with the constant in the famous equation: E =
mc^{2}.
Recently, I've been considering the usefulness of certain
calculated values. While not constants, these relatively small values are
proving themselves extremely useful. Perhaps the most famous to computer
science professionals is the ubiquitous check sum. The check sum's purpose is
to succinctly and efficiently represent the state of a set of data such that it
is unlikely in practice for the data or check sum to change without the change
being detectable. A well designed check sum will detect data corruption with
high reliability and low cost (size and computation). Note that the check sum
cannot guarantee that it detects all possible errors, yet in practice, it
proves useful.
Check sums over the years have ranged from the lowly parity
bit, to the sophisticated error correcting codes of modern memory and disk
drives that support recovering from many common errors. In fact, without error
correcting codes, large memory systems (pretty much any current PC) would be
unreliable as the probability of a cosmic ray affecting a memory cell is
surprisingly
large.
Another kind of check sum algorithm is making a big splash.
It claims to create a unique, fixed sized value for any input you or anyone
could or would generate in a lifetime. In short, it claims that different
inputs always generate different outputs. When I was first introduced to this
idea, I was skeptical. It seemed impossible. Indeed it is easy to see that it
is impossible to guarantee absolutely. But the claim isn't that the algorithm
does the impossible, but only that the unique mapping is true in
practice. I can't explain the details, but such algorithms have been
created and the best of them are very, very good at that guarantee. So good
that the best mathematicians have been unable to create examples disproving the
claim.
Algorithms such as this are known as cryptographic hashes. A
primary use is to validate that a block of data has not been modified. Any
change, whether intentional or accidental, results in a different hash. Many
products are signed with a cryptographic hash, and download sites often publish
md5 or sha1 hashes so the download integrity can be verified before unpacking.
Another use is in hashing passwords. By storing the hash, the password is not
stored and thus can't be compromised, yet can be validated simply by comparing
the hash of the entered password with the stored hash. A third use is
identifying a file or any arbitrary data. The idea is that if every file has a
different hash, the hash can represent that file. Therefore, if a duplicate
hash is seen, we know the contents must also be the same. In this application,
the hash serves as an efficient means for identifying content and comparing for
equality.
SnapshotCM makes use of a cryptographic hash to identify
whether file content is modified locally. This allows validation more efficient
than a background check out and compare. SnapshotCM product images on Windows
are also signed, another form of cryptographic hashing that verifies product
image integrity.
For further reading on cryptographic hashes, I recommend
An
Illustrated Guide to Cryptographic Hashes, which also contains further
references for anyone who wants to dive deeper.
SnapshotCM uses two other useful number generating
algorithms I plan to discuss in coming months.
