Principles of Content Addressing
I’ve been working on a content addressable storage system for around two years now. I’ve come up with several prototypes over that period, but most of that time has been spent trying to get the basic design right. I finally nailed down the last remaining pieces and for the past two months I’ve been hard at work on a practical implementation. However, at the moment I’m hung up on some performance problems and I’m not going to do a release until the system is more than just “usable.”
In the meanwhile, I’d like to put forth some guidelines that other projects in this space may benefit from, so that we can stand on each other’s shoulders instead of on each other’s feet. These ideas are intended to be common sense, down-to-earth directions to help define the problem space and perhaps even allow these projects to interoperate, without trying to subsume each other as “modular back-ends” (which every project seems to dream of).
The first and most important rule:
Don’t namespace your hashes!
Content addressing is cool because it lets you turn any amount of data into a fixed-length opaque name. Cryptographic hashes are also extremely fast, and on modern hardware you can usually hash data faster than you can transmit or store it. So designers of content addressing systems tend to think that there’s no harm in inserting their own meta-data or system-specific information into the hash along with the original user data.
Although this data becomes “invisible” inside the resulting hash, doing this causes all hashes from your project to be incompatible with hashes from everywhere else. I call this “namespacing” and it works the same as salting password hashes. A historical example of this is BitTorrent. The hash BitTorrent uses for looking up a file is not the same as the hash of the file itself, which makes BitTorrent’s magnet links pretty useless for other projects.
This is important not just for compatibility between different content addressing systems, but also for existing users of well-known cryptographic hash algorithms, and for convenience whenever someone might want to compute a hash for your system “manually.” It should be possible to generate your hashes independently using existing tools, e.g. using
sha1sum(1) on a file from the command line.
In order to track meta-data without introducing namespace issues, a system should use “sidecar” objects or files. These files might be associated with the primary file via its hash, and the system would track those reverse relations.
Content addresses should be URIs
Allow me to quote a post I saw on Hacker News some time ago:
It seems natural in retrospect, but it actually took a surprisingly long time to invent. The URL didn’t exist until 1990; people had been putting files on FTP sites since 1973 or so, 27 years at that point, and there wasn’t even a standard piece of software which, given “wuarchive.wustl.edu:/foo/bar/baz.txt”, would give you the contents of baz.txt. (ncftp supported that by 1994, IIRC, but no vendor shipped ncftp as standard.)
Obviously it wasn’t that people couldn’t figure out how to do that. It’s that nobody understood that it was an important thing to be able to do. It seems really obvious in retrospect, but it wasn’t.
Today, there still are a surprising number of people who create network-accessible persistent objects without making them URL-addressable. Apparently it’s still not obvious to everyone that it’s an important thing to be able to do.
Unfortunately, URNs and magnet links have sort of poisoned the well for this idea. So allow me to dispatch both ideas with prejudice:
- URNs were designed around the initial idea of turning ISBNs into hyperlinks. However, ISBNs are dynamically assigned by a central authority. Content hashes are statically assigned by a decentralized authority (a fixed algorithm that anyone can run). It turns out that dynamic, central assignment is exactly what regular URLs are good at, so the original use case of URNs was invalid. No cryptographic hash, not even ancient and broken MD5, was ever registered as a URN namespace. Anyone using URNs for content hashing is breaking the spec, and likewise, a new content addressing system would be unable to support any of the existing URNs.
However, those failings don’t mean the core idea is bad. Content addresses as URIs would make our web less fragile as servers come and go over time.
This also defines the scope of any content addressing proposal. We need content addressing at the user interface layer (or just below), not at the storage layer. Block deduplication should be left for the low-level file system or the hard drive controller, in the cases where it is even worth doing. Take advantage of the tower of abstraction.
Support arbitrary, pluggable hash algorithms
Over time, new cryptographic hashes are developed and old ones become weaker as flaws are found. A content addressing system that wants to promise long-term storage shouldn’t be tied to the obsolescence of one algorithm.
A content addressing system may use a particular algorithm internally as its on-disk lookup scheme. These days, this ought to be SHA-256 or better. However, this internal hash should be wrapped in an indirection layer that can support other hashes from arbitrary algorithms. This way, new algorithms can be added and old links keep working even if the algorithm is obsolete and deprecated for new use.
The only requirement for a hash algorithm is that it be a pure function, taking the input data (and possibly datatype, such as a MIME type), and returning a fixed list of zero or more hashes for that data. No other inputs can be used and the resulting hashes for a given input must not change over time. Regardless of algorithm, hash resolution is done using either strict string equality or else prefix comparison (either option has some tradeoffs).
Sometimes hashing less is more
Content addressing is all about data identity, or asking the question, “what makes two files ‘equal?’” The answer depends on the intended purpose of the hash: for legal work or banking, a cryptographic hash that checks bit-for-bit equality is the safest bet. For an image, we can come up with several definitions:
- Is the image file bit-for-bit identical?
- Is the rendered image pixel-for-pixel identical (but ignoring embedded Exif meta-data or even file formats)?
- Is the image visually identical (taking into account human perceptual limits, e.g. lossy compression)?
- Is it the same basic image (ignoring scaling, white balance, etc.)?
A content addressing system should allow users to choose the appropriate algorithm for each of their links, accepting image and audio fingerprints, fuzzy hashes, and semantic hashes. A semantic hash is just an algorithm that normalizes, quantizes or discards part of its input before passing it onto a cryptographic hash. For example, you could link to a particular version of a blog post using a cryptographic hash, or you could link to any version of that blog post, allowing for small changes like typo corrections, using a semantic hash.
Common content address URI format
The “hash” URI scheme is neutral (not project name-specific) and communicates the requirements of an algorithm to be eligible (there is no acceptable content address that isn’t some sort of hash). Double-slash syntax is used for two reasons: one, because it is more commonly accepted and easier to recognize by parsers, and two, because the “algo” fills the role of the URI authority, parallel to DNS for naming web servers. The URI path is just the hash string as generated by the algorithm (case-sensitive to support encodings like base-64). No extra subpath components should be added by the system, because the path is solely controlled by the algorithm (as the path of
http: URIs is controlled by the web server). Query parameters might be accepted depending on the system, and would be resolved by the system, not the algorithm.
Shorter URIs can be supported either dynamically by prefix comparison or statically by having algorithms generate alternate short forms of each result. To avoid accidental collisions, about 12 bytes of entropy is the minimum. That’s 24 characters in hex or 16 characters in base-64. Full hashes are recommended when length isn’t an issue or in cases where it’s important to protect against intentional (malicious) collisions.
Good luck designing and developing content addressing systems. Thank you for your time.