What’s wrong with seccomp?
seccomp-bpf is basically a system call firewall. Which is exactly what I call for in the aforementioned post. It’s widely used in Chrome, Firefox, Firejail (a general-purpose application sandbox, like what I wanted), etc.
I think the problem with it is that it is (so far as I know) always used with application-specific rules. Obviously its use in Chrome and Firefox is tuned for those programs, and Firejail uses application-specific profiles to decide which system calls to allow.
This has three negative effects:
- Large/complex/powerful applications that use a lot of syscalls get less protection
- The allowed syscalls are determined primarily/entirely by the application and what it needs/wants to use, rather than which are most likely to be secure
- The syscall profile is less tested because it is unique and changes along with each app
So, what is the alternative?
Well, as hinted at before, start by locking down all of the system calls as much as possible. Then build a “syscall emulator” that reimplements all of the blocked syscalls in terms of the few allowed ones.
So for example,
open(2) might be blocked. Instead, the sandbox opens a single file at startup, which the emulator treats as a block device and runs its own (sandboxed) file system inside of. This is obviously a heavy-handed approach, and it might be more convenient to allow native filesystem access, but at least it would be very secure and protect against almost all filesystem bugs (except those that could be triggered by reading or writing to a single file).
Browsers sandbox network access by tunneling traffic through special protocols (WebSockets, WebRTC). However all of these APIs have high overhead, which prevents things like the implementation of DHT in WebTorrent. A sandbox could reduce this overhed by allowing raw UDP, except with a special header. This header would identify the payload as untrusted and prevent it from interfering with applications that didn’t expect it. (Of course you’d still need other restrictions too, to prevent denial of service attacks.)
How do you decide what syscalls to allow? In a CPU sandbox like NaCl, there are three basic concerns:
- Turing completeness
- Ease of validation
The bare minimum (for most purposes) is a Turing-complete subset. However, you will probably need to choose/add some instructions for efficiency, too (MOV-only programs are very slow). At that point, the overriding concern becomes how easy it is to create a correct validator.
For other types of sandboxes, like system calls, the concept of Turing completeness doesn’t apply. I like to generalize it to what I call “hardware completeness,” for lack of a better term. That simply means that all of the features of the hardware (disk storage, networking, camera, mic, USB) should be possible to expose to sandboxed applications.
If any bugs in the underlying platform are discovered which affect the security of the sandbox, the sandbox can (hopefully) be changed to a different or more restrictive feature subset, without impacting any (correct, non-malicious) sandboxed programs. Of course if the platform is too broken then eventually sandboxing simply becomes impossible.
In the concrete case of emulating system calls on Linux, there is a problem for some software like Golang and libuv which perform system calls directly. These calls might be hard to intercept efficiently, without slowing down the whole program. Perhaps the simplest approach is to treat the sandbox as its own non-Linux platform, and require that applications for it follow its own syscall ABI. In other words, add special back-ends for libc, Go and libuv. It might even be possible to support applications targeting a different platform, like Windows, as long as there’s a suitable place to add a shim.
In conclusion, I think this is the first “complete” sketch of a sandbox that is efficient, feasible to build, maximally secure, and potentially able to run unmodified applications.