Raft is the consensus algorithm used by every modern distributed database that needs strong consistency. A cluster of N nodes elects a leader via random election timeouts and majority votes; the leader serializes all writes through a replicated log; followers acknowledge each entry before it's committed. When the leader fails, a new election picks the next one in seconds. The whole protocol fits in 5 pages and is now the industry default — Etcd, Consul, TiKV, CockroachDB, MongoDB all use it.
Any time you need a strongly-consistent replicated state machine — distributed databases, configuration stores, lock services, leader-elected schedulers (Kubernetes' Etcd is Raft), distributed lock managers, durable queues. The alternative was Paxos, which is famously hard to implement; Raft is intentionally simpler and has become the default choice for new systems.
Watch an election happen
The visualization runs a real Raft state machine — randomized election timeouts, vote requests, majority commit. Hit “Kill leader” and watch the next election fire within a few seconds.
Why Raft exists
For decades, the gold standard for distributed consensus was Paxos (Lamport, 1989). It works, but it’s notoriously difficult to understand and even harder to implement correctly. The Paxos paper was rewritten three times because the original was unreadable. Production implementations leaked subtle bugs for years.
It decomposes consensus into three independent sub-problems:
- Leader election — who’s in charge?
- Log replication — leader appends entries; followers acknowledge.
- Safety — committed entries are never lost.
The three-state machine
Every node is always in exactly one of three states:
stateDiagram-v2
[*] --> Follower
Follower --> Candidate: election timeout<br/>(no heartbeat)
Candidate --> Leader: wins majority votes
Candidate --> Follower: higher term seen<br/>OR another wins
Leader --> Follower: higher term seen
Candidate --> Candidate: split vote<br/>→ new election
note right of Follower
Passive. Listens for
heartbeats from a leader.
Resets election timer
on each heartbeat.
end note
note right of Candidate
Increments term, votes for self,
sends RequestVote RPCs.
Random timeout prevents
split votes.
end note
note right of Leader
Accepts client writes.
Replicates via AppendEntries.
Sends heartbeats to keep
followers from timing out.
end note
Why the term number is the heart of it
Every Raft message includes the sender’s term. The protocol’s safety property:
This single rule handles:
- Stale leaders that recover from a partition (their old term is now smaller; they step down)
- Split brain (only one leader can win the highest term)
- Vote requests from candidates with stale logs (rejected because the term is too low)
Log replication — the boring but critical part
The leader serializes all writes:
sequenceDiagram
autonumber
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
participant F3 as Follower 3
participant F4 as Follower 4
C->>L: set x = 5
Note over L: Append to local log<br/>(uncommitted)
par AppendEntries to all followers
L->>F1: AppendEntries
L->>F2: AppendEntries
L->>F3: AppendEntries
L->>F4: AppendEntries
end
F1-->>L: ACK persisted
F2-->>L: ACK persisted
Note over L: Majority (3/5) ACK ✓<br/>mark COMMITTED
L->>L: Apply to state machine
L-->>C: success
Note over F1,F4: Learn commit on<br/>next heartbeat,<br/>then apply locally
Why odd numbers — 3 or 5, not 4
Where Raft falls short
Why this knowledge matters
Even if you never implement Raft, you’ll use systems built on it every day. Understanding leader election, terms, and quorum tells you immediately:
- Why your Etcd cluster is unhealthy after a region failure
- Why your CockroachDB latency depends on majority placement
- Why MongoDB’s primary changes after a network blip
- Why Consul stops accepting writes during a partition
- Why Kubernetes stops scheduling pods when its etcd backend loses quorum
Comments 0
Discuss this page. Markdown supported. Be kind.