$title="The two-thirds consensus protocol"; $id="EventMLTutorial"; include_once "../../../prlheader.php"; ?>
Consider the following problem: A system has been replicated for fault tolerance. It responds to commands issued to any of the replicas, which must come to consensus on the order in which those commands are to be performed, so that all replicas process commands in the same order. Replicas may fail. We assume that all failures are crash failures: that is, a failed replica ceases all communication with its surroundings. The two-thirds consensus protocol is a simple protocol for coming to consensus, in a manner that tolerates n failures, by using (precisely) 3n + 1 replicas.
Input events communicate proposals, which consist of integer/command pairs: ⟨n,c⟩ proposes that command c be the nth one performed. The protocol is intended to obtain agreement, for each n, on which command will be the nth to be performed, and to broadcast a notify message with those decisions (which are also integer/command pairs) to a list of clients.
Each copy of the replicated system will contain a module that carries out the consensus negotiations. In this specification we describe only those modules (which we continue to call Replicas). To specify the full system we would have to include a description of how those decisions are used. That is done in the description of the Paxos protocol (section 6).
Browsing notes:Readers may find it helpful to display the full code alongside this page.
![]()
For convenient display, we split the full specification into smaller chunks: figure 4 contains the prefatory information (parameters, imports, type definitions, message declarations, variables, auxiliaries) and figures 5 through 8 define the classes. Section 5.1 walks through code, redisplaying fragments of the text as they are discussed. A reader may find it helpful first to concentrate on the informal description of each class provided and then, before studying details, turn to section 5.2 to see some scenarios showing the protocol in action. Section 5.3 explains why the protocol satisfies the basic safety property of consistency—it will not send contradictory notifications. That section also defines the precise sense in which the protocol can “tolerate” up to flrs “failures,” but does not provide a proof of that.
This section comments on the preliminary definitions given figure 4, and also introduces the library combinator until.
Parameters
The parameters of the protocol are
We make no assumptions about who submits inputs or constraints on how they are submitted.
The declaration of the Cmd parameter also introduces a parameter for an equality operator:
When we instantiate the type Cmd, we must also instantiate cmdeq with an operation that decides equality for members of that type. The keyword Deq denotes a type constructor: (Cmd Deq) is the type of all equality deciders for Cmd. We need cmdeq because we want to apply deq-member to compute membership in a list of commands; as noted in section 3.2, we must therefore supply an equality decider.
Variables
One reason for the variable declarations, such as
is to introduce notational conventions that make the specification easier to read. Type checking will object if the notations are misused.
A second reason is to help the type inference algorithm, which sometimes requires hints about the types of the arguments to functions being defined. An expression or pattern may be labeled with a type, which will be checked statically, and may also constrain polymorphism that might otherwise arise. E.g., after
foo is the polymorphic identity function on every type; bar is the identity function on integers; and baz is the identity function on pairs whose second coordinate is boolean. Typically, we want library functions to be highly polymorphic and widely applicable, but the functions defined in EventML programs to be much more constrained. By and large, the polymorphism of an EventML program is expressed in its parameters.
Without variable declarations for ni and sender the definition of the newvote operation would have to be expressed as
but with those declarations, we may simply write
As a practical matter, there’s not much point in trying to anticipate where type inference needs hints. Most commonly, help may be needed when the right hand side of the definition calls on a polymorphic function such as deq-member, which operates on lists of any type that has a decidable equality operator.
The balance between introducing variable declarations and adding type labels to patterns and expressions is a matter of taste.
Auxiliaries
We introduce a convenient notation for specifying the “init” parameter of SM*-class or Memory* (section 4):
Used in that context, (init x) is the function that assigns the initial state x to every location.
Class combinators
The specification uses one new library combinator:
Replica is the event class characterizing the actions of a decider. As noted in figure 8, the main program
installs a decider at each location in locs.
For each n, a Replica will spawn (at most) one instance Voter to communicate with other instances of Voter and come to consensus on a single proposal of the form (n,_).
For each n, NewVoters spawns a Voter in response to the first proposal or vote it receives concerning command n.
We define consensus on proposal ⟨n,c⟩ to mean that 2/3 (plus one) of the replicas vote for it. On any particular poll of the voters that degree of consensus cannot be guaranteed—so we allow do-over polls, for which we adopt the following terminology. Successive polls for each command number are assigned consecutive integers called innings; the pair ⟨command_number,inning⟩ is called the polling or voting round.
Votes are of type Vote. Each contains:
This section refers to figure 5.
A Replica acts when NewVoters does, in response to propose and vote inputs. These are recognized by the class Proposal:
Proposal observes the value of type Proposal input in its input.
ReplicaState maintains the state of a Replica, enough information to recognize the first time it sees a Proposal-event about command n (meaning a value of the form ⟨n,c⟩ for some command c). Its state has type Int * (Int List). The Int component is the greatest n for which it has seen such an event; and the (Int List) component is the list of all natural numbers less than that maximum for which it has not yet seen a proposal event.
The initial state of a ReplicaState is ⟨0,nil⟩. The infix operator ++ is the append operator and the imported Nuprl operations from-upto and list-diff have the following meanings:
![from- upto i j = [i;i+ 1;i+ 2;...;j - 1]
list- diff (op =)[a;b;...][m;n;...] = the result of deleting all occurrences of m, n,... from [a;b;...]](tutorial11x.png)
Every event is a ReplicaState-event, and observes the state of the state machine when the event occurs (before any processing).
NewVoters-events are Proposal-events. NewVoters compares the data observed by Proposal with the state of the replica when the message arrives, in order to decide whether it is the first proposal about some n.
A Voter is a parallel composition of two classes:
where:
This section refers to figure 6.
Round ((n,i),c)
Round ((n,i),c), running at location loc, broadcasts a vote from loc for command c in round (n,i) and runs an instance of Quorum. Quorum (n,i) keeps a tally of votes received at l in round (n,i) and uses that tally to determine either that consensus has been reached (in which case it notifies the clients and sends every Replica, including itself (i.e., the replica that spawned it), a decided message) or that consensus might not be possible in inning i (in which case it sends to itself a suitable retry message).
(Quorum (n,i)) is a state machine that responds to vote messages. Intuitively, its state consists of a pair ⟨cmds,locs⟩. Each time it receives a new vote for proposal ⟨n,c⟩ in inning i, it prepends c to the list cmds. locs is the list of the locations that sent those commands. (We keep the list of senders so that, if a vote from any sender is delivered multiple times, it will only be counted once.) The initial state is a pair of empty lists. (QuorumState (n,i)) is the “pre” Moore machine that observes this state when a vote arrives.
The transition function for (QuorumState (n,i)) is (add_to_quorum (n,i)). A vote message is a no-op unless it’s a vote in round (n,i) that comes from a new location. If it’s both, then the vote is tallied by prepending to it state components the command it votes for and the location of its sender.
Quorum (n,i) is a Mealy machine defined from QuorumState. It produces an output once it has received votes from 2 flrs + 1 distinct locations. If all of them are votes for the same command d, it broadcasts notify and decided messages. If not, then it is possible that on this round no proposal will ever receive 2 flrs + 1 votes; so it sends itself a retry message to trigger initiation of inning i + 1. (Once it has sent the retry message it will ignore any votes it subsequently receives in round ⟨n,i⟩, even if they would result in some proposal’s receiving 2 flrs + 1.)
Consider first the outer conditional. The (cmds,_) argument matches the value observed by QuorumState, so (length cmds) is the number of votes tallied before the input arrives. If this test fails then, even with the new input, the state machine will not yet have received 2 flrs + 1 votes, so the input is ignored.
Consider the inner conditional. The imported operation poss-maj implements the Boyer-Moore majority algorithm. Thus, the locally defined constants k and x have the following meaning: If some element of the list c.comds appears in a majority of its entries, x is that element and k is the number of times it occurs. Thus, the inner conditional tests for unanimity.
The data of a retry message consists of the new round to be initiated and, in addition, the name of a command to propose in this new round. The definition of roundout attempts to choose that command in a reasonable way: So, if the votes are not unanimous, but some command receives a majority, that majority-receiving command will be proposed in the retry message.8This is crucial to the correctness of the protocol.
This section refers to figure 7.
Halt n
Halt n recognizes the arrival of decided message with body n. We make it a class of type Unit, since the only information conveyed is the fact that the message has arrived.
NewRounds n
Recall that (NewRounds n) decides when to initiate a new round of voting about the nth command and, when necessary, spawns an instance of Round, supplying it with a new round number of the form ⟨n,_⟩ and a command to vote for in that round.
(NewRoundsState n) is a “pre” Moore machine. It’s state is an integer, initially 0. At any location it keeps track of the greatest inning i for which it has “participated” in a round of the form ⟨n,i⟩. A location has “participated” in such a round if it has received a retry message with data ⟨⟨n,i⟩,_⟩, or a vote message with data ⟨⟨⟨n,i⟩,_⟩,_⟩. So its input events are recognized by RoundInfo, which observes the round/command pair embedded in its input.
The transition function, update_round, updates the state whenever its input constitutes participation in an inning greater than the current state value:
There’s some redundancy in defining the Mealy machine NewRounds from NewRoundsState.9The next version of the library will contain a different set of combinators that avoids that. The condition in when_new_round is the same as that in the transition function updated_round: when the transition is a no-op, NewRounds ignores the input; when it’s not, NewRounds passes along the input that caused the update.
This section contains message sequence charts that describe some possible runs of the 2/3-consensus protocol. To make the charts easier to read, all message arrows are drawn horizontally (except for self-messages).10A horizontal arrow does not imply instantaneous communication. That requires a small, but semantically inessential, deviation from the official semantics of EventML. Actions that are atomic in EventML may be shown as nonatomic. Consider figure 9. The top diagram shows A broadcasting message x to B, C, and D as a single event. At C, the act of receiving message x and replying with y is atomic. The second diagram teases everything apart.
We can represent delay in message transit, in part, as a delay in sending the message. Since only message arrivals are observable, no distinction between the picture and the official semantics will be observable.
A detailed look at retry Figure 10 shows (part of) one possible run of the consensus protocol, in which a round ends not in consensus but in a retry that starts a new round. We assume that flrs = 1, so there are four instances of Replica and a proposal will be accepted if it gets three votes. The diagram does not depict all the classes—in particular, we show only three of the replicas—and does not display all the messages sent. It contains abbreviations, which are defined in the following table:
| vote1x | = | [vote : ((2, 0), x, l1 ) ] |
| vote2 | = | [ vote : ((2, 0), x, l2 ) ] |
| vote4y | = | [ vote : ((2, 0), y, l4 ) ] |
| retryx | = | [ retry : ((2, 1), x) ] |
| vote′2x | = | [ vote : ((2, 1), x, l2 ) ] |
| α | : | start Round ((2,0),x); Quorum state = ([x], [l1 ]) |
| β | : | Quorum state = ([x; x], [l1 ; l2 ]) |
| γ | : | Quorum state = ([x; x; y], [l1 ; l2 ; l4 ]) |
| δ | : | start Round ((2,1),x); Quorum state = ([x], [l2 ]) |
Note that votes not marked with a “′” are cast in inning 0 (i.e., in this case, round (2,0)) and votes marked with “′” are cast in inning 1.
This run begins when the Replica at location l1 receives a proposal (2,x) from the environment. We assume that location l1 has not previously received a vote or proposal for command 2; accordingly, it responds by spawning an instance of Voter (2,x) at l1. Only one component of this Voter will play a role: Round ((2,0),x). This class broadcasts vote1x, a vote for the proposal it received—though the diagram shows only two of those messages. Its Quorum component plays no role in this part of the run.
A Replica can respond to either a vote or a proposal. When the Replica at location l2 receives vote1x (also assumed to be new), it spawns an instance of Voter (2,x) at l2. This initiates an instance of Round ((2,0),x) at location l2, which will broadcast vote2 and spawn an instance of Quorum (2,0) at l2. Of this broadcast we show only the message it sends to itself.11The Replica at location l2 sees this vote but, as it has already seen a vote for command 2, the self message does not cause it to spawn a new Voter. Comment α says that the vote that spawned the Round updates the internal state of Quorum to ([x],[l1]), recording the fact that a vote for command x came from l1. As β indicates, the self message updates the state of this Quorum to ([x;x],[l1;l2]).
Meanwhile, the Replica at location l4 has received a competing proposal: that command 2 be y, not x. It spawns Voter (2,y), which broadcasts vote4y; we show only the message received by the Voter at l2. This updates the state of Quorum at l2 to ([x;x;y],[l1;l2;l4]). Once it has received votes from three distinct locations Quorum makes a decision: in this case, because the votes are not unanimous, it must start a new round by sending itself a retry message.12It is possible that the fourth Replica would cast a vote for proposal (2,x), providing the three votes, but that would come too late. As δ indicates, this retry starts Round ((2,1),x). So the Voter at l2 begins by broadcasing vote′2x.
Notification and retry in the same round Figure 11 shows part of a run in which the Voter at l1 broadcasts a notification that the second command will be x, but the Voter at l2 sends a retry that launches a new round. As before, the diagram does not depict all the classes or all the messages sent. Instead of walking through the successive states of the Quorum classes, we only note their states when they reach a decision. The abbreviations are as follows:
| vote1x | = | [ vote : ((2, 0), x, l1 ) ] |
| vote2x | = | [ vote : ((2, 0), x, l2 ) ] |
| vote3x | = | [ vote : ((2, 0), x, l3 ) ] |
| vote4y | = | [ vote : ((2, 0), y, l4 ) ] |
| decidedx | = | [ decided : (2, x)) ] |
| notifyx | = | [ notify : (2, x) ] is broadcast to all clients |
| retryx | = | [ retry : ((2, 1), x) ] |
| α | : | Quorum state = ([x; x; x], [l1 ; l2 ; l3 ]) |
| β | : | Quorum state = ([x; x; y], [l1 ; l2 ; l3 ]) |
| γ | : | start Round ((2,1),x); Quorum state = ([x], [l2 ]) |
The first three votes seen by the Voter at location l1 are votes for x, so it notifies all clients that agreement has been reached—command 2 is x—and sends a decided message to stop all the Voters working on command 2. The Voter at location l2 sees two votes for x and one for y and it launches a new round before it receives the decided message that stops it. The crucial point is that, on launching this round it casts its vote for x. If the retry proposed y, it might be possible that the remaining voters in some later round would come to consensus on command y; clients would then receive a contradictory notification saying that command 2 is y. Section 5.3 explains why this calamity cannot occur.
Failure to achieve consensus Figure 12 illustrates a run in which this protocol fails to achieve consensus, a possibility that, according to the FLP theorem FLP85] is inevitable. The abbreviations are as follows:
| vote1x | = | [ vote : ((2, 0), x, l1 ]) |
| vote2x | = | [ vote : ((2, 0), x, l2 ]) |
| vote3y | = | [ vote : ((2, 0), y, l3 ]) |
| vote4y | = | [ vote : ((2, 0), y, l4 ]) |
| retryx | = | [ retry : ((2, 1), x ]) |
| retryy | = | [ retry : ((2, 1), y ]) |
| α1 | : | Quorum state = ([x; x], [l1 ; l2 ]) |
| α2 | : | Quorum state = ([x; x; y], [l1 ; l2 ; l4 ]) |
| α3 | : | start Round ((2,1),x); Quorum state = ([x], [l1 ]) |
| β1 | : | Quorum state = ([y; y], [l4 ; l3 ]) |
| β2 | : | Quorum state = ([y; y; x], [l4 ; l3 ; l1 ]) |
| β3 | : | start Round ((2,1),y); Quorum state = ([y], [l4 ]) |
We omit the Voter classes spawned at locations l2 and l3, depicting their messages as coming directly from the Replica classes themselves. In round (2,0), the Replica at l1 votes for x and the Replica voting for y. This exchange of messages results in abandoning round (2,0). But round (2,1) begins in exactly the same way: with Replica at l1 voting for x and at l2 voting for y. The pattern can in principle repeat endlessly.
If, in any round, some Voter finds a quorum for command x, then, in that round, x is the only command that can be proposed by a retry message.
PROOF: Suppose that one Voter sees 2f + 1 votes for command x in a given round. Since each Voter votes for only one command in any round, that round can contain no more than f votes for any command other than x. Now consider the situation of any other Voter making a decision in that round: It will have received 2f + 1 votes, and at most f of them can be for a command other than x. Therefore, at least f + 1 of the votes it sees must be for x; so if it sends a retry message, that retry proposes x.
The argument is not quite done. Suppose one Voter finds a quorum for x in round ⟨n,i⟩ but other Voters do not, and will therefore participate in subsequent rounds. Is it possible that one of those later rounds contains a vote for some other command y (possibly as the result of a new proposal received from some external source), and that, as a result, some later round ⟨n,j⟩ finds a quorum for y? No, because a stronger property holds.
If some Voter finds a quorum for command x in round ⟨n,i⟩ then in any round ⟨n,j⟩ with j > i all votes cast are votes for x.
PROOF: Every vote can ultimately be traced either to a retry message or to a proposal message received by some Replica from an external source. However, a Replica will ignore a proposal with body ⟨n,c⟩ unless it has never before received either a proposal or a vote for something of the form ⟨n,_⟩. Thus, votes that arise from external proposals can be cast only in rounds of the form (_,0). That is to say that all votes in round ⟨n,i⟩ with i > 0 arise from retry messages sent in round ⟨n,i - 1⟩. So, by induction, once we encounter any round in which all retry messages are for command x, all subsequent rounds can only contain votes for x.
Fault tolerance When a process suffers a crash failure it stops sending messages. (It does not perform erratically by, e.g., violating the requirements of the protocol.) The 2/3-consensus protocol will tolerate up to flr crash failures, in the following sense:
All executions of the protocol that suffer only crash failures, and no more than flr of those, are non-blocking—that is, execution never reaches a state from which consensus is impossible.
By the FLP theorem, this is the strongest fault tolerance guarantee that a consistent consensus protocol can provide.