For the past week or two, I’ve been working on designing a new consensus algorithm.
My original intent was to stay as far away from consensus algorithms as possible, since they’re widely known to be Really Hard. But I was working on a distributed back-end for libkvstore, and none of the existing libraries for Raft or Paxos that I found met my requirements. Plus some of them I couldn’t even figure out how to use.
So I started down the dark path of reading the Raft paper, either to learn how to use a Raft library, or to implement my own. Then I had a flash of insight that made me think it’d be easy to design a new algorithm from scratch.
Now a couple weeks later, I just encountered my second “back to the drawing board” moment. So I figured I’d write about these gotchas of consensus.
Let it be known that I am still just a beginner in this field. I only got drawn into it grudgingly.
Gotcha #1: You need two round trips to commit in the general case.
It seems logical that you can just broadcast a write to all of your peers. Once you get approval from a simple majority (
N/2+1), the write is committed and you’re free to commit it locally. And, so far as I know, that does in fact work for a single possible writer and a single possible write.
However if you allow multiple writers, which naively you might try to do, you can easily get a split vote where no quorum is possible. Even worse, if one or more peers is unresponsive, you can enter an ambiguous situation where you don’t know which write reached a majority, if any. That can produce a deadlock with just one unavailable peer, when the algorithm should be able to handle almost half.
In order to resolve that generally, you need two round trips, which is why Basic Paxos has a “prepare” phase and an “accept” phase. The prepare phase locks out previous writers so they don’t interfere, and then the accept phase either writes a single value or unambiguously fails.
Gotcha #2: Of course, it isn’t that easy.
The problem is that an “accept” can be interrupted by someone else’s “prepare,” bringing back the ambiguity. You can have a write that is partially accepted and partially rejected because some of the peers promised to only accept someone else’s write instead. In the presence of even one unavailable node, the outcome can be indeterminate.
Naturally it turns out that Basic Paxos has a solution. (To be clear, I’m not sure I totally understand it yet, which is one reason I’m writing it down.) Basically, when you go to write, you are told if there is a write that is potentially ambiguous, and you’re obliged to “force it through” before doing your own write. It’s like if a door closes on you while you’re walking through it, so the person behind you gives you a kick to get through. (Or, you know, reopens it for you.)
A weird thing about this is that even a write that you thought failed (say it was rejected by all peers but one) can end up succeeding anyway. It might succeed or it might not, depending on which peers are up and which are down, or the order they respond in. That violates my understanding of “reached a quorum == committed, otherwise not.”
Anyway, that’s my understanding so far. We’ll see if this new algorithm thing pans out, or whether this was just a time-consuming learning experience.