HomeLog InSource

notes.public

[View] [Short] [Hash] [Raw]

2017-07-27

The difficulty of content addressing on the web

Right now when you look up a content address on the Hash Archive, it returns a web page with a bunch of links to places where that hash has been found in the past. At that point, you have to manually try the links and verify the response hashes until you get one that works. Wouldn’t it be nice if this process could be automated in the browser?

Unfortunately, the original Subresource Integrity draft got scaled way back by the time it was standardized. Basically, modern browsers can only verify the hashes of scripts and style sheets. They can’t verify the hashes of embedded images, iframes, or links, as cool as that would be.

I think my own ideal solution here would be to support “resource integrity” on HTTP 30x redirects. For example, you try to look up a hash, and the server responds with 302 Found. The headers include the standard Location field which points to some random URL out on the web (potentially from an untrusted server), plus a new Hash field that specifies what the hash of the expected response should be.

The browser follows the redirect, but before presenting the content to the user (inline, as a download, or whatever), it verifies the hash. If the hash is different from what was expected, the redirect fails.

Now if you’ll allow me to really dream, imagine if this worked with the (basically vestigial) 300 Multiple Choices response status. The server could provide a list of URLs that are all supposed to contain the same resource, and the client would try each one in turn until the hash matches. In the case of the Hash Archive, which can’t be certain about what 3rd party servers are doing, this would make the resolving hashes more reliable.

Okay, great. What about alternatives and workarounds that work right now?

Option 1: Server-side hash validation

With this idea, instead of a redirect, the server loads the resource itself, verifies it, and then proxies it to the client. Obviously proxying data is unappealing in and of itself, but the bigger problem is that the entire resource needs to be buffered on the server before any of it can be sent to the client. That’s because HTTP has no “undo” function. Anything that gets sent to the client is trusted, and there’s no way for a server to say, “oh shit, the last 99% of that file I just sent you was wrong.” Closing the connection doesn’t help because browsers happily render broken content. Chunked/tree hashing doesn’t help because you’ve still sent an incomplete file.

Buffering is completely nonviable for large files, because it means the server has to download the file before it can respond, during which the client will probably time out after ~60 seconds. It’s difficult (if not impossible) for the server to stall for time because HTTP wasn’t designed for that.

That said, if you limit this technique to small files (say less than 1-4MB) it should actually work pretty well. It’s also nice because you can try multiple sources on the server and pass along the first one to download and verify. For any solution that requires proxying, you will probably want to have a file size limit anyway.

BTW, this approach also fails to catch any errors after validation between the server and client.

Option 2: Client-side validation with JavaScript

One version of this idea was proposed by Jesse Weinstein. Another version, using Service Workers, was proposed by Substack.

Respond with a full web page that includes a hash validation script, and then download the file via AJAX, hash it, and then write it out via data:. This is pretty much how the file hosting service Mega’s client-side encryption works, except with hashing instead of encryption.

Mega actually works pretty well, but it’s purely for downloads. When you start thinking about embedded content, it becomes messier and messier to the point of not being worth it, IMHO. The thought of telling users to copy and paste JavaScript just to hotlink images gives me the heebie-jeebies.

This still involves buffering the data in memory (client side), which puts a cap on the file size, and proxying content through my server (to dodge the Same Origin Policy), which implies another cap and makes it less appealing.

Basically I don’t think this approach would be in good taste.

Option 3: Browser extension!

No, just no. I will write a mobile app before I pin my hopes on a browser extension. Plus, the Chrome extension API is extremely limited and probably doesn’t even let you do this.

Conclusion:

I’m ready to throw in the towel, at least for now. Server-side validation might be good for a party trick at some point. Trying to make content addressing on the web work (well) without the support of at least one major browser vendor doesn’t seem feasible.

I think problems like this have interesting implications for the web, the current approach to sandboxing, and solutionism in general.

P.S. Thanks to the great Archive Labs folks who discussed all these ideas with me at length!