(We now return to your regularly scheduled programming.)
If you’re going to have decentralized web archiving, a big problem is trust. Transport layer security lets you retrieve pages securely, but they still aren’t signed in a way that you can show the page to someone else and prove you got it from the original server. That means that only a reputable organization like the Internet Archive can usefully host snapshots of websites.
The goal is for archiving and mirroring to be possible by random people without needing to trust them, which I call trustless decentralized archiving. The general principle is very similar to reproducible builds: even if no one can individually prove their archive is valid, as long as anyone can independently reproduce it, then it’s possible to reach a certain level of confidence.
The problem is that modern web pages are fundamentally irreproducible, and we can’t expect site owners to change their pages to help us (if we could, we wouldn’t be in this mess). We need a reliable, repeatable way to strip out random, varying and irrelevant content (non-content) that is simple enough for people to agree on and works on almost all sites.
The first idea was to use Readability. Readability is a set of heuristics for extracting the primary content on a web page. This is exactly what we want, but heuristics are never perfect. This has the same problem as my idea of semantic hashing[#], which is to say baking too much logic into fixed data. If the Readability function changes, you can no longer verify old content (unless you carefully maintain backward compatibility forever).
The second idea was to use RSS feeds. RSS typically contains just the page content, and it’s usually more static and limited than free-form HTML. Unfortunately RSS is time-based and I don’t think most feeds give you a way to go back and look at expired entries. This also doesn’t work for probably the majority of sites that don’t support RSS.
The third idea, which is the first one that I find really promising, is similar to using Readability, except it decouples the heuristics from the output data. The idea is to store whatever additional information you need to reproduce it in the output itself. In the simplest form this might be a CSS selector. The selector could be decided however you want (by Readability or even manually), but once you have it you can reproduce the relevant section of the page automatically.
For an HTML file, you could just put the selector in a comment at the top. This would still effectively be a new file format, but it seems simple enough that people should be able to agree on it, or something similar.
I’m going to be in San Francisco in a couple of weeks and I’m looking forward to talking to the Internet Archive about this idea.
Keywords: web archiving, trustless decentralized archiving