HomeLog InSource

notes.public

[View] [Short] [Hash] [Raw]

2017-01-28

Database servers versus embedded databases

To most programmers today, a database is something you connect to. I want to try to explain why that kind of sucks, and what the alternative is.

Common databases like PostgreSQL, MySQL, MongoDB, and others are all “database servers.” That means the database is controlled by some daemon process that you talk to indirectly, usually over Unix domain sockets or TCP.

There is an alternate world of databases, known as embedded databases (or database libraries), which run in your application process and which you talk to using plain old function calls. The best known of these is of course SQLite, but there are also key-value stores like BerkeleyDB, LMDB and LevelDB, another SQL DB called Firebird, and a smattering of others (WiredTiger, Sophia, etc.).

The most important advantage of embedded databases is that they are application-local. By that I mean that they don’t need to be installed or configured separately from the application. Often times they are statically linked (for SQLite you drop two C files into your build system) and they are configured directly from your code. Users don’t have to worry about database management: if your application is running, the DB is running. Multiple applications that want to use the same database system don’t conflict, because they don’t use any global resources (e.g. ports or configuration). Installing your application doesn’t permanently drain memory with a background DB always running.

These advantages are most critical for native and mobile apps, but I think they’re even compelling for server software. The problem of DB server configuration and conflicts is I think part of the reason behind the push for things like Docker. I have nothing against Docker, but in a sense it’s just a static linker for processes, and now you have the additional challenge of maintaining persistent state inside of containers. If you want to use Docker that’s fine, but you shouldn’t be forced to in order to have isolation between DBs.

Embedded databases are also typically more secure, by virtue of having less attack surface. Just in the past couple weeks there have been a rash of “hacks” of thousands of misconfigured (by default) MongoDB, CouchDB and other servers. If your DB doesn’t even bind to TCP, it’s much harder to accidentally leave it exposed. Not to mention the inherent paradox (and hassle) of storing an unencrypted password with your application in order to connect to the DB.

What about performance? Well, remember the ongoing bus-wreck that is KDBus (now BusOne)? The complaint is that DBus is slow because it’s not in the kernel. Your application (in userspace) has to switch to the kernel, then the kernel has to switch to the DB process. Then the same process in reverse to get the response back. If you are reading from the DB, it hopefully has its data cached, meaning these context switches are the biggest overhead involved. If you are writing, nothing should actually touch disk until the transaction commits, and each transaction probably has several statements. If you remember Jeff Dean’s “list of latency numbers every programmer should know,” context switches are ‘not good.’ An embedded DB is even faster than KDBus: it’s in user-space, in your process. No context switch, no syscall (cf. LMDB).

The cost of context switches is more insidious than it might seem. If you have a high fixed cost, the way to increase performance is bigger batching. However that means you are forced to use larger and more complex queries. Whether you think pushing more logic into the DB is a good idea or not, being forced to do it for performance reasons is unpleasant. One example of when it can be annoying is when you want to filter a list of DB results client-side, while simultaneously using a LIMIT clause. Because some results are being discarded, LIMIT can’t be used. Depending on the database server and client library, that may be more painful or less. (A simple key-value store API becomes atrociously slow because each operation is so small.)

The traditional complaint about SQLite is “bad concurrency,” but I think that’s mostly due to confusion and bad defaults. For the sake of backwards compatibility, SQLite3 ships with WAL mode disabled. You should turn it on, always. You should also set a “busy timeout,” so that SQLite will seamlessly handle connection busy errors that make people think it doesn’t scale. (Also check the SQLite pragma docs because there are a lot of neat settings in there.)

Even in WAL mode, SQLite can only handle a single concurrent writer. However, in my experience, the sheer overhead of SQL and the bad write performance of b-trees are much more important bottlenecks. Only once you’ve made your transactions as fast as possible should you start to worry about overlapping them, because the gains of overlapping them are fundamentally limited. These are problems that all popular databases have, whether embedded or server. There are embedded solutions for each of them (SQL performance: embedded key-value stores; b-tree write overhead: LSM-trees or fractal trees; concurrent writes: non-transactional write batches).

What if you are running a large server farm? Well I admit that’s not something I’ve done. But I think there would be a performance advantage to co-locating the app and DB on the same servers. If you need horizontal scaling, then you might want an embedded, distributed database, which is something I’m working on. (I don’t know of any existing ones, probably because the advantages of embedded databases are not widely recognized.)

I’ll grant that if you want to put a security boundary between your application and DB (in other words, if you don’t trust your application), then a database server is the best way to do that. But most DBs aren’t locked down against malicious applications, and it can be difficult to implement sufficiently fine-grained permissions without custom code. (If you’re worried about the attack surface of your application, you should also be worried about the attack surface of your DB, which probably isn’t very well hardened either.)

What are the other advantages of DB servers? Well, your DB can be written in any language, like Golang (surprisingly common, e.g. CockroachDB or etcd). But embedding Go is supposed to be possible.

Are there other reasons to prefer DB servers? Maybe, but I can’t think of them. This may have implications for the current trend of microservices, but I’ll save that for another time.

Even if you’re still convinced DB servers are nicer to use, embedded DBs still have one final advantage: an embedded DB can always be put in a server wrapper. If the DB is developed as a server from the start, it’s a lot more work to turn it into a library.