HomeLog InSource

notes.public

[View] [Short] [Hash] [Raw]

2017-11-18

Content-Addressable Storage versus Eventually Consistent Databases

Content addressing is, as far as I know, the best way to build an eventually consistent database. But it’s become apparent to me that there is actually not a lot of overlap between the two concepts beyond that.

My project StrongLink tries to be both. It provides content-addressable storage for files, and lets you find files by hash URI. It also tries to track file meta-data using what I called “meta-files”, which are files that store meta-data about other files.

This split personality ended up majorly over-complicating everything. Per-file meta-data ended up being a per-file eventually consistent database, with more complexity and less generality than a single large database would’ve been. Syncing was especially confusing, because you can use the meta-data in order to decide what files to sync (you basically end up needing two separate sync algorithms).

There are other differences and tradeoffs between content-addressable storage and eventually consistent databases. Content-addressable storage deals in files, which are likely to be large and mostly or entirely redundant, even during normal use. An eventually consistent database deals in commits or transactions, which are more likely to be small and rarely if ever redundant, except when healing after a network partition. A database needs to parse transactions into indexes; a storage system may need to break files into chunks (although I still maintain this is bad if your users rely on your hashes as part of your public interface, especially if they must be compatible between different storage systems).

If you want an eventually consistent database, building it on top of a general-purpose content-addressing system might not be the best fit, unfortunately. And if you want a pure content-addressing system, especially for performance, an eventually consistent database will be both overkill and slow.

It’s quite possible that this split simply mirrors the traditional dichotomy between file systems and databases. One is fast, dumb, and wide; the other is slow, clever, and deep. Don’t mix them up like WinFS did.

Keywords: tradeoffs