For a long time I have put off thinking about mutability in content addressing systems. Thankfully, Jesse Weinstein pointed out that talking up flaws in other systems without presenting my own alternative was a little unfair.
So I’ve spent some time putting together an initial but hopefully comprehensive overview of mutability in a distributed/decentralized content addressing system.
My original plan was to have applications submit individual changes (commits), and then synthesize groups of changes into snapshots of files, like the working tree in Git. Unfortunately, after Jesse’s comments I quickly realized this wouldn’t actually work: StrongLink would provide addressing of the commits, but there would be no ability to resolve file hashes (without relying on some other application).
After more thought, I eventually came up with a list of possible approaches:
- Centralized, like the Web today
- Blockchains (e.g. Namecoin)
- Independent immutable files (commits, as above)
- Last write wins (including public-key addressing)
- Native diffs
I don’t think centrally controlled mutability will ever go away, but obviously I’m trying to do something different.
Blockchains and the like are quite interesting, but I want something that works offline (meaning AP).
The last three options are where it gets interesting.
As mentioned, using one file per “edit” is very easy and obvious. Each file can have arbitrary meta-data for querying. This is still the best option for things that don’t really have/need a strict identity, like a forum thread or instant messaging history (for posts and messages respectively). Each file has a useful content address (e.g. for replies).
The next option, which currently seems quite popular, is public-key addressing. However, I see this as a subset of last-write-wins. The basic idea is that you query the system for a particular identifier (for example a public key, or just a filename), and then you use the most recent matching result. To my understanding, this is not just how IPNS works currently, but how it’s intended to work even once it’s complete.
In theory, public-key addressing should prevent conflicting writes. However for a user with multiple devices (e.g. laptop and phone), there still exists a race condition between updating (especially if the laptop is asleep, or they can’t sync for some other reason). Because the storage system has no deeper understanding of the semantic changes, there is no way to resolve conflicts.
There’s another problem with this approach: meta-data. If meta-data is assigned to each raw file, then when a mutable file is “renamed” (pointed at a different raw file), the meta-data stays behind. In theory it’s possible to attach meta-data directly to the mutable handle, but this adds complexity and might be difficult in a distributed setting (exercise for the reader).
For the record, StrongLink has pretty good support for last-write wins addressing right now. In fact, it’s quite powerful because it works with arbitrary queries. You can ask for “the latest file named X and signed by Y” or whatever you want. There are two limitations currently: 1. StrongLink doesn’t yet verify digital signatures itself (known issue), and 2. there is no URI format/protocol for actually addressing files in this way.
Actually, there’s a third problem, StrongLink’s queries might actually be too powerful, making it hard for other systems to use the same address format. But it would be straightforward to define a portable subset. Even
/ipns/ address resolution might be possible (it depends on how IPFS computes them, which I haven’t looked into).
There’s one other concern with public-key addressing: it gives the person with the key a lot of power over the links. Less than the traditional Web gives to each server operator, but even so, it’s a very similar model. Think of it like a directional channel between two people. This is probably good for “websites” representing corporations or individuals, but for “documents” it doesn’t seem that great. (StrongLink obviously keeps a full history of writes, as every system supporting this model should.)
Also… when you store each write as a separate, immutable, “raw” file, you still need some kind of deduping or diffing to prevent large amounts of space being wasted. StrongLink doesn’t currently have this, and other systems which do block deduplication achieve it at great cost (IMHO).
Okay, finally, onto the last option.
One more option is what I call “native diffing” (as opposed to application-level diffing described initially, or the implicit diffing that might be part of last-write-wins compression). This is how Git really works. It has its own diff engine(s) (and accepts plugins), and has basically a full understanding of how changes are made and how to merge them.
However, there are some problems:
- Complex file formats (mainly binaries) require complex diff plugins
- The system cannot always resolve merge conflicts
- Files must be “assembled” to be addressed or loaded
Wouldn’t it be nice if we could address all of these problems in “progressive enhancement” way?
Well, in that case, the best way to handle fully mutable documents in StrongLink (including decentralized collaboration) is by… storing diffs as meta-data. Pretty simple, right?
StrongLink already stores meta-data as a flexible, recursive CRDT. So the application itself can represent its edits however necessary for that file format. And StrongLink doesn’t stop you from getting conflict-free merges, if possible. If there are conflicts, the application is already there to (help) resolve them.
Over time, if a particular change representation became popular (say just diff/patch for text files), support could be added directly to StrongLink as a module (or even as a reverse proxy).
The reason I find this truly compelling is because of the following question: what address should a mutable file be known by?
- In the case of file composition, the answer is there is no address (which makes sense because the particular set of components can be changed with queries)
- In the case of last-write-wins, the address is some query (such as public key, which is effectively random)
- In the case of meta-data-based mutability, the answer is the file’s original hash
To me it doesn’t seem like there’s a better answer. The fact is that mutable files can be changed in unpredictable ways, so it’s impossible to choose a name up front that will always reflect its current content. But at least the file’s original content is still unique and meaningful.
There’s still another question: if you start a blank new file that you intend to edit, what should it be called? I believe the answer is “it’s up to the user.” A mutable file’s initial content (and thus, address) will determine whose decentralized edits will be able to merge with it.
- If your file is conceptually unique (or private), choose a random string
- If you want to collaborate with a small group of people, you should mutually agree on a “file name”
- If you want to share changes with an unknown group or everyone, choose a short and easily guessable name
The ultimate test of this would be a real-time collaborative editor, like EtherPad or Google Docs. I believe a fully decentralized EtherPad clone based on StrongLink is fully possible using this approach.
Now for a disclaimer: StrongLink’s meta-data support is still incomplete. For now, mutable meta-data is basically faked client-side, meaning performance slows down the more edits a file has. Fixing this is already a high priority, and it’s straightforward enough that complications are unlikely to arise.
And to address Jesse’s implicit question: how is this different from Camlistore specifically? It differs in two ways: 1. the address of a mutable file is the original hash of the file, and 2. based on the file’s content, “useful” collisions are likely to occur (or easy to create), allowing two disconnected nodes to add a file and even begin editing it without duplication.
I think it’s clear that StrongLink’s design is fairly complex. There are many different ways to achieve any particular goal, and those differences will have ramifications for mutability, querying, and other attributes. Hopefully this article is enough to show that the design is sufficiently powerful to handle (ahem) unanticipated needs, in a way that is still possible to reason through and makes sense in retrospect.