So I started writing the header file last night, which some might argue was a violation of my rule of sleeping on new ideas[#]. But the goal of the rule is to spend time thinking about them before digging in, and laying out the header file is a way of doing that.
Anyway, I started working on the implementation this morning and the first 80% came together really fast. I have all the basic (non-cursor) ops written, a good start on the cursor ops, and manual compaction done.
The remaining cursor ops are hung up on the API for jumping to the first/last item and initializing the cursor if necessary when doing next/previous.
The other big thing left is auto-compaction. We have to track meta-data on each level of the tree (mainly just the number of items, not sure what else), and we have to come up with a way of deciding how much to compact on each commit.
Scaling MySQL and MariaDB to TBs: Interview with Martín Farach-Colton.
TokuDB has much higher insertion rates, and in particular, very high single-threaded insertion rates. Thus, the slave can keep up with a much more demanding workload.
Oh, I get it. “Slave lag” is the same problem we’re having with doing fast syncs in EarthFS, even though it isn’t a master/slave system.
The Log-Structured Merge Tree (LSM) also achieves very high indexing rates, but at the cost of so-called read amplification, which means that indexing is much faster than for B-trees,
Oh, I get it. When you walk down the fractal tree, you pick up any intermediate messages along the way, since they’re in the same place.
However, that’s why LSM-trees usually support bloom filters for lookup. So the complaint isn’t actually valid, and in fact bloom filters might actually be faster. Fractal trees have to pay for their message system with smaller branches.
Q9. How is TokuDB different with respect to SchoonerSQL, NuoDB and VoltDB?
You have to be pretty badass to handle questions like this in an interview. I’ve never even heard of SchoonerSQL.
LevelDB file layout and compactions
Reading over this again. AFAICT, it’s just so needlessly complicated. Their 2MB files are basically pages in a regular b-tree.
Compactions for a particular level rotate through the key space. In more detail, for each level L, we remember the ending key of the last compaction at level L. The next compaction for level L will pick the first file that starts after this key (wrapping around to the beginning of the key space if there is no such file).
They have to do this because files never get rewritten once they’re stored. However, we’re using a b-tree, so it automatically handles dividing things up so branches stay relatively balanced. That means we don’t have to rotate, and in fact maybe we could see better performance by not rotation.
By always compacting from the head or tail of each level, we an intentionally skew data across the levels so that perhaps cursors will be more efficient. Instead of hitting the cursor for each level randomly, we’d mostly hit the cursor for one level over and over, then swich to the next level, etc.
Worth trying at least.
Also, I think that if we compact the last level from the tail, and/or the first level from the head, we can save the underlying b-tree a lot of work because the items don’t actually have to move at all. I’m not sure which one we should choose, or if we should do both.
Right now we always fully compact level zero at the end of each transaction. You can even do compaction in the middle of a transaction and open cursors (should hopefully) keep working.
Oh whoops, no they won’t. When we move an item from one level to another, the cursor at the new level has to pick up on it. Fixing that would take a lot of extra work too.
Well, the cursors can lose their positions, but at least they won’t completely break.
Figuring out how to do auto-compaction intelligently seems tough. If you have a “full” tree, one write at the top needs an additional write at every level. Perhaps the rule is number of writes * number of levels * overall fullness. A tree that’s mostly empty should be allowed to fill up, whereas a tree that’s extremely full has to be pruned back.
We have to be careful that if one row is too full and another row is brand new and mostly empty, they don’t average out. Or actually, that might be okay, as long as we prune the full row whenever we prune anything.
We should also make sure that we don’t do compactions below a certain size, because they just aren’t efficient.
It’s actually probably better not to take the current write size into account at all. We don’t want to make big transactions even slower, we want to shave the peaks.
With any luck I should be able to finish this up tomorrow. Pretty excited to test it out.