association-list a veritable mint for dunning-kruggerands

The road from 2pc to Paxos

Attention conservation notice: This is a non-rigorous, under-referenced, and probably subtly (maybe baldly) wrong description of how strong distributed consensus mechanisms evolved in reaction to issues with prior replication protocols. It is written by a non-expert who only has one, under-tested consensus system to his name. I start with a deliberately bad version of 2pc that no one would ever write in the real world. There may not be much here for the advanced reader.


While it’s a common refrain by more experienced engineers that most people will never need more than a single beefy database machine, an increasing number of workloads now require better fault tolerance than a single machine can provide, lower latencies than an RDBMS can manage, or both. While a number of these systems are eventually consistent, more and more are adding stronger consistency models. Thus, it’s becoming more and more important for engineers to understand how these systems work, and how the can fail. While you may never need to implement paxos, you’re actually quite likely to have to write an application that works with primitives provided by ZooKeeper, and it pays to understand how these systems function, fail, and limp.

Our first steps

Two phase commit is a value replication protocol. This means that it’s concerned with keeping a coherent view of data across some number of machines, so that if one fails, the data is safe, and still recoverable. In action, it looks something like this:

                 A         B         C
      set k1=3   |         |         |
client+--------> |         |         |
                 | k1=3    |         |
                 +-------> |         |
   propose       | k1=3    |         |
                 +-----------------> |
                 |         |         |
                 |         | ok k1=3 |
                 | <-----------------+
                 |  ok k1=3|         | accept
                 | <-------+         |
                 |         |         |
                 | do k1=3 |         |
                 +-----------------> |
     commit      | do k1=3 |         |
                 +-------> |         |
         ok k1=3 |         |         |
client <---------+         |         |

It works pretty well, but degrades less than gracefully when there are lost or out of order messages. While you can work on these things by using TCP and adding retries, where this protocol falls down badly is node failures. Everything is working great, all of the node in your system agree, and everything is happy, until one day your leader node dies at an inopportune time and everything is deadlocked forever. This isn’t great, but since we’re not the kind of people to give up just because something doesn’t work at first, let’s start tackling 2pc’s problems and see how that leads us to failure-tolerant strong consistency systems like Viewstamped Replication and Paxos.

What about 3pc?

The three phase commit protocol is another attempt at fixing these problems, but I tend to think of it as a parallel line of research. While it fixes deadlock problems, it still has other issues. In the end, most successful consensus algorithms use 2pc, so 3pc seems less important to cover here in detail.


Since in a real network, nodes can fail and messages can be lost, we first work on leadership failure. All of our follower nodes keep a timer which is reset each time a message is received from the leader. We use timers because perfect error detection is impossible; we simply cannot tell whether the leader is gone, slow, or is on a bad network link, so we use communication gap lengths as a heuristic for failure. When the failure timers fire, the node on which they fire declares that the leader has failed and proposes itself as the next leader of the cluster. Since this would cause a great deal of conflict and contention, the timers are usually staggered. This leadership proposal is done in much the same way as a leader would propose a client-written value. The other nodes then either accept or reject, the system tries again, probably via backoff, until we have an unambiguous winner.

Elections are not the only option here. For example, Viewstamped Replication has a failure protocol where all nodes know of all the nodes in the cluster and sort them all using the same algorithm (VR uses a lexical sort of IP addresses). The first node in this sorted list is the leader. When it fails, the next working node on the list becomes the leader, and so on.

Regardless of how we came to the decision, we have a leader now, right?


Maybe not. In the case where we have a network partition, the initial leader may not have failed! Take the following network: {A B C} | {D E F}, where A is the initial leader. D, E, and F cannot send messages to anything on the left side of the partition, and so eventually they elect E as their leader. Both A and E now think that they’re the leader and proceed to scribble all over the cluster’s shared state by processing client requests. This is a condition known as ‘split-brain’. We solve the split brain problem by saying that to be elected leader, a leader must have the votes of a strict majority of all the other nodes in the cluster, not just the healthy ones. In our network with six nodes, this means that neither A nor E would have the 4 votes (including their own) that they would need to guarantee that the whole cluster is acting together. More formally, we need 2f + 1 nodes to endure f failures and f + 1 votes for a successful leadership election. Since for a three node partition failure you’d need seven total nodes and four nodes on one side of the partition in order to proceed, so this entire cluster would need to fail to a stop.

People often complain, when looking at replicated systems, that they can’t have just two, that the systems are designed for three, five, or more nodes. This is why. This is also why six node clusters like the above are rare; you don’t get anything but overhead from node six. You cannot tolerate three failures until you get to seven.

The temporal ratchet

Sometimes, though, leaders are not actually down. They may be in garbage collection, or perhaps they’ve just been bogged down by too many messages, or something else happening on the machine where they’re running. Suppose, after a leader election, a leader that all nodes had decided had failed comes back to life, starts taking client requests and trying to propose new values. Who is the right leader? How are followers to know whose proposals to accept? Let’s take a quick side trip into abstraction, then we’ll come back to some concrete solutions.

The illusion of linear time

In order to provide the intuitive behavior that most people expect from replicated storage systems, (i.e. that they behave like single node systems), the system must provide a property called linearizability. The linearity here is that of logical actions in a system upon a timeline. While we have a system of perhaps many parts talking to each other on a fallible network, we want to present, externally, the appearance that we’re a single system, processing all requests in a strictly sequential fashion. Consider a history that looks like this (note that A here stands for the leader and elides communication between nodes):

    B k=1 ok      B k=5 ok           B k?     k=8
     \   /         \   /              \      /
      \ /           \ /                \    /
            /  \          /  \        /  \
           /    \        /    \      /    \
          C k?   k=1    C k?   k=5  C k=8  ok

And we need to be able to do this even when A fails and subsequent requests are sent to node D. All operations that have been acknowledged to the client must remain true across all of the nodes in the system.

    B k=1 ok      B k=5  ok
     \   /         \    /
      \ /           \  /
            /  \
           /    \
          C k?   k=1

                                     B k?     k=8
                                      \      /
                                       \    /
                          /  \        /  \
                         /    \      /    \
                        C k?   k=5  C k=8  ok

In order to do this, we need for all of the parts in our system to share an ordering of events. We could do this with timestamps, but the wall clock is not your friend. So we use logical timestamps, typically integer-based epochs, which is simply a number that gets incremented every time something happens. In our most recent case, the something is a leadership election. In a production system, you’d likely have two epochs in a tuple: {leader_epoch, request_epoch}, where you compare first on the leader’s epoch, and then the request epoch, with any message with a higher leader epoch coming logically later than any message with a lower leader epoch regardless of the request epoch.

A pawl for our ratchet

In order that nothing gets confused when old messages arrive, or stale, confused nodes return to the network, every actor must keep a record of the newest epoch that it has seen, in order to discard any message older than the current epoch. Additionally, all messages must contain the epoch of their origin node when they are sent. Implemented correctly, this guarantees that time in the system can only march forward and that any message can be ordered with respect to all of the other messages in the history of the system.

Let’s take a moment to rehearse our progress so far. We have a solution for failed leaders, a solution for split-brained clusters, and a solution to misordered messages and confused nodes from the deep past. Is this enough to ensure linearizable (often called “strong”) consistency?

Reads are no longer simple

Kind of! 2PC is mostly concerned with writes; it doesn’t require that reads do anything particularly special as long as they’re done against the leader. But in our new, fault-tolerant world, simply reading from the leader and assuming that anything that has been committed there won’t work. Why? Because the leader we’re talking to might be lagged out of the network, partitioned from its followers, who’re now busy electing a new leader and carrying the state of the cluster away from the state known of by the leader when it got our read. Thus, we must thread our reads through our consensus machinery, so that they enjoy the same linearizability guarantees as writes. This means that we need to hear from a majority of nodes that the read value is still correct.

OK, it works, now how can we make it fast?

This … this is a big, big question. There are any number of approaches here to all aspects of cluster performance. Since this is such a broad topic, I’ll just touch briefly on two aspects, reducing network round-trips and key partitioning.

Network round-trips

One key aspect of research in bringing consensus to production has been the reduction of round trips between the various aspects of the system. Largely this has been done by coalescing messages (e.g. VRR’s commit ack is carried on the next proposal unless a timeout fires, see VRR section 4.1, step 6) and using time-based leases to allow reads from the leader to not go through the protocol within certain tight bounds (less than the timeout when other nodes will elect a new leader. Note that this reliance on wall clock time (it isn’t your friend, remember) makes things less safe, and imposes additional requirements on clock quality in the system. We need to be sure that clocks are steady and ticking at approximately at the same rate, or broken leases can allow bad reads.

Key partitioning

An early insight into how to optimize these systems was that unrelated operations did not need to run through the same consensus machinery to be properly ordered (since they have no ordering with relation to each other, any possible ordering is correct). Thus, if you can partition your writes into different keyspaces, you can run multiple leaders (often on the same group of nodes), in order to increase throughput. Some variants (notably Egalitarian Paxos) take this even farther, by noting that individual writes will often have dependencies at an even narrower granularity than a keyspace, i.e. if a read, write, or transaction involving this key is not underway on 2f + 1 nodes, then the operation can proceed safely down the fast path on the node that accepted it (rather than a specific leader). This allows for simpler cluster construction and operation at the cost of some protocol complexity.

A few words on Byzantine failure

Byzantine faults are the really bad ones. Misbehaving or buggy nodes, malicious nodes attempting to corrupt the cluster state, and incomplete partitions, where some nodes have radically different views of the cluster. They are, generically, quite hard to deal with. The best that most protocols can do (without massive complication) is to fail safely to a stop. Other protocols try to do more, but my read of the current academic consensus is that while some subsections of the problem are solvable, and most protocols can be cheaply hardened against a number of Byzantine faults, that complete tolerance of Byzantine failures cannot be achieved, only approached. This is why all open, distributed ledgers are doomed to failure.


The following are a selection of papers and sites that I’ve found useful when learning about this topic.


I’d like to thank Andrew Stone and Fred Hebert reviewing this piece and suggesting substantial improvements.

<-- Older