Here are some of my recent posts related to content addressing:
Lots of comments are asking what problem this solves.
First, content addressing is the best general solution to exactly-once message delivery in a distributed system. Examples of distributed message protocols include (ahem) email, HTTP and RSS. The alternative, UUIDs, is strictly worse in almost every way (which is to say, if you’re currently using UUIDs for anything, consider switching to hashes of the content).
Second, content addressing guarantees the integrity of messages, meaning as long as you have the hash, you can get the data from anywhere. This is highly useful for mirroring and failover. Basically, we might be able to solve linkrot to a large degree, and make easier to mirror sites (especially locally, which is useful for high latency and offline).
You need to think about what makes messages distinct. Content addressing will help you as long as you have a rigorous definition of identity, which you might otherwise be doing ad-hoc when you assign UUIDs.
If the same message is meaningful at different points in time, your hashed content should include a timestamp (and of course you need to choose an appropriate level of precision). If your message is meaningful in different contents, it needs to reference that context somehow (ideally by hash).
The payoff for doing this well is 1. an elimination of double-sent emails, double-posts, etc. (when the software might not know if it went through the first time), and 2. the ability to deduplicate across a network partition, for example if a user makes the same change on their computer and on their phone and then syncs them.
Unfortunately at this point I made a horrible mistake. I mentioned timestamps and distributed systems in the same context, which prompted millions of people who have read Aphyr’s blog to leap in to correct me. After all, wall-clock timestamps are terrible for ordering events!
Summing up a long and painfully boring debate:
They’re events, but no ordering is assumed, guaranteed, promised, implied, or enforced. The timestamps, if you choose to use them, are only for determining message identity (equality) and thus can be as accurate or inaccurate as relevant for your application. […] If you do want ordering, BYO. Content addressing and UUIDs are equivalent in that regard.
[…] Content has to be defined at the application level, based on what the application is trying to accomplish. If absolutely every message is logically unique, then under a content addressing system each message needs to include some random data, which will result in random hashes equivalent to UUIDs. That is the worst case scenario for content addressing.
Even so, it’s still “as good or better” (aside from performance concerns, which I don’t mean to dismiss).
And to sum up:
Content hashes are just like UUIDs except you can get useful collisions if you want them.
As far as I know, there are two novel ideas in this set of posts:
- Content addressing (and to a lesser extent UUIDs) is the best general solution to the exactly-once message delivery problem
- Content hashes are strictly more powerful than UUIDs in almost all cases (whenever the content is available to be hashed)
Not novel but worth repeating is the idea that basically every internet-based technology, even things with a centralized server, has parts that act as distributed systems. And yes, most of these things are subtly broken in ways that we put up with and ignore. (Which, of course is fine–if it ain’t broke, don’t fix it–but maybe we could do better.)
I’ve been thinking about writing up a “scholarly paper” expounding and expanding on these ideas, but that sounds like a lot of work. Plus, now that I’ve put them out there, maybe it’s too late. Oh well!
Keywords: distributed systems, content addressing