HomeLog InSource

notes.public

[View] [Short] [Hash] [Raw]

2017-05-13

I’ve been thinking about a couple of ideas for a modern filesystem.

The first idea is to use log structuring to improve application correctness and performance. Common filesystems today reorder writes typically using what is known as an elevator algorithm. Basically of all of the pending blocks to write, instead of writing them in chronological order, it writes them from lowest to highest, and then from highest to lowest. On a mechanical drive, that minimizes seek times, and on SSDs, it’s still not completely worthless.

The problem with write reordering is that it is that it makes fast applications incorrect, and correct applications potentially slower than before. The reason for that is that by reordering writes, you can unintentionally change application semantics, in particular during a kernel panic or power failure. If an application writes A and then B, you’d expect a hard crash to leave the disk in one of three states: nothing, A, or AB. However, write reordering means than the disk can contain B but not A. Hard crashes are rare, but when they happen this can easily cause data corruption.

In order to mitigate this problem, applications have to insert write barriers. The only easy and reliable write barrier is fsync(2) (or fdatasync(2)) (you can also use hashes/checksums, but that requires changes to the file format and extra recovery logic). fsync requires flushing all the way to disk, so now in order to speed things up, you’re actually doing more work and making them slower.

Log structured filesystems can solve this problem. They can guarantee ordering of all writes without write barriers, while simultaneously being even more efficient than the elevator algorithm, because they put all writes together into a contiguous log. Then you need some sort of compaction, but that seems like a worthwhile tradeoff (for reasons I won’t bother explaining again). The side benefit of this design is that it makes LSM-trees in userspace (such as LevelDB) redundant. Any ordinary b-tree stored on this filesystem, including any off-the-shelf database, effectively becomes an LSM-tree with no modification.

I see log-structured filesystems as a generational leap ahead of copy-on-write filesystems like ZFS and btrfs.

The second idea is implementing directories as recursive mounts for simplicity, reliability and security. Microkernels always had great promise but one of their major limitations is that their services (such as filesystems) are still typically monolithic. For example you have a user process that sends messages through the kernel to a filesystem process. All applications that share the same filesystem talk to the same process and can trigger bugs or exploits that affect all other applications.

Also, if you want to do sandboxing right, you need to put any filesystem that contains untrusted data inside the sandbox. The main way to do that is recursive mounts, so you have an outer (trusted) filesystem containing images of inner (untrusted) filesystems. However that’s needlessly complicated and more of a hassle to deal with since none of the standard tools understand filesystem images as opposed to regular directories.

Instead, you could build a purely flat filesystem (no hierarchy) that didn’t support directories at all, and run copies of it within its own files to act as directories. First of all, this would vastly simplify your filesystem design, making it more robust. Second, you could easily run the inner filesystems in different processes/sandboxes, giving strong isolation between different branches of the hierarchy.

This requires several finer points, but no dealbreakers that I’m aware of. First, you need to carefully design the API so that deeply nested filesystem operations don’t lose much efficiency. In particular, you need to ensure the log (as above) is only kept once, at the top layer, instead of repeatedly within each filesystem. You also need to ensure that data deleted within a filesystem is reported to the parent (similar to TRIM for SSDs). These are all problems that affect loop mounts and VMs today, and are mitigated to some extent already.

The other potential pitfall is opening the same directory within multiple processes/sandboxes at once. But serverless databases like LMDB and SQLite can already handle that.

As an example of the security benefit, say that the user downloads a malicious file that exploits a Unicode filename or metadata bug in the filesystem (although really, the filesystem probably shouldn’t be parsing Unicode in the first place). In this design, the damage is contained to the download directory, rather than being able to spread throughout the filesystem regardless of permissions.

Of course in a monolithic kernel, or for efficiency, you can run multiple instances in the same process (or directly in the kernel), although you lose most of the security benefits.

In conclusion there’s plenty of room for improvement.