← Distributed Systems

👑

Distributed Systems

Raft Consensus

How a cluster of nodes elects a leader and replicates a log so they stay in lockstep — without a coordinator.

TL;DR

Raft is the consensus algorithm used by every modern distributed database that needs strong consistency. A cluster of N nodes elects a leader via random election timeouts and majority votes; the leader serializes all writes through a replicated log; followers acknowledge each entry before it's committed. When the leader fails, a new election picks the next one in seconds. The whole protocol fits in 5 pages and is now the industry default — Etcd, Consul, TiKV, CockroachDB, MongoDB all use it.

When to use

Any time you need a strongly-consistent replicated state machine — distributed databases, configuration stores, lock services, leader-elected schedulers (Kubernetes' Etcd is Raft), distributed lock managers, durable queues. The alternative was Paxos, which is famously hard to implement; Raft is intentionally simpler and has become the default choice for new systems.

Watch an election happen

The visualization runs a real Raft state machine — randomized election timeouts, vote requests, majority commit. Hit “Kill leader” and watch the next election fire within a few seconds.

Why Raft exists

For decades, the gold standard for distributed consensus was Paxos (Lamport, 1989). It works, but it’s notoriously difficult to understand and even harder to implement correctly. The Paxos paper was rewritten three times because the original was unreadable. Production implementations leaked subtle bugs for years.

It decomposes consensus into three independent sub-problems:

Leader election — who’s in charge?
Log replication — leader appends entries; followers acknowledge.
Safety — committed entries are never lost.

The three-state machine

Every node is always in exactly one of three states:

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate: election timeout<br/>(no heartbeat)
    Candidate --> Leader: wins majority votes
    Candidate --> Follower: higher term seen<br/>OR another wins
    Leader --> Follower: higher term seen
    Candidate --> Candidate: split vote<br/>→ new election

    note right of Follower
        Passive. Listens for
        heartbeats from a leader.
        Resets election timer
        on each heartbeat.
    end note
    note right of Candidate
        Increments term, votes for self,
        sends RequestVote RPCs.
        Random timeout prevents
        split votes.
    end note
    note right of Leader
        Accepts client writes.
        Replicates via AppendEntries.
        Sends heartbeats to keep
        followers from timing out.
    end note

Why the term number is the heart of it

Every Raft message includes the sender’s term. The protocol’s safety property:

This single rule handles:

Stale leaders that recover from a partition (their old term is now smaller; they step down)
Split brain (only one leader can win the highest term)
Vote requests from candidates with stale logs (rejected because the term is too low)

Log replication — the boring but critical part

The leader serializes all writes:

sequenceDiagram
    autonumber
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant F3 as Follower 3
    participant F4 as Follower 4

    C->>L: set x = 5
    Note over L: Append to local log<br/>(uncommitted)
    par AppendEntries to all followers
        L->>F1: AppendEntries
        L->>F2: AppendEntries
        L->>F3: AppendEntries
        L->>F4: AppendEntries
    end
    F1-->>L: ACK persisted
    F2-->>L: ACK persisted
    Note over L: Majority (3/5) ACK ✓<br/>mark COMMITTED
    L->>L: Apply to state machine
    L-->>C: success
    Note over F1,F4: Learn commit on<br/>next heartbeat,<br/>then apply locally

Why odd numbers — 3 or 5, not 4

Where Raft falls short

Why this knowledge matters

Even if you never implement Raft, you’ll use systems built on it every day. Understanding leader election, terms, and quorum tells you immediately:

Why your Etcd cluster is unhealthy after a region failure
Why your CockroachDB latency depends on majority placement
Why MongoDB’s primary changes after a network blip
Why Consul stops accepting writes during a partition
Why Kubernetes stops scheduling pods when its etcd backend loses quorum

🧪 Simulator soon

An interactive simulator for this concept is on the way — tweak the knobs, watch behaviour change in real time.

💻 Code Phase 4 soon

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions

Q1. Why is randomized election timeout important? ▾

It prevents split votes. If every follower's election timeout were identical, a partition could trigger all of them to become candidates at the same instant, all voting for themselves, and nobody winning a majority. Random timeouts (typically 150–300ms with jitter) make it overwhelmingly likely that one candidate starts first, sends RequestVote messages, and wins before the others time out.

Q2. How does Raft handle network partitions? ▾

The minority side has no leader (can't reach majority for elections), so writes are refused there. The majority side keeps a leader, accepts writes, and replicates among the reachable replicas. When the partition heals, the minority's logs are reconciled with the majority's via append-entries — any uncommitted entries on the minority are overwritten. No data is lost on the majority side; the minority's losses were never committed.

Q3. What's the role of the term number? ▾

The term is a monotonically-increasing logical clock that identifies each leader regime. Every message carries the sender's term. If a node sees a higher term than its own, it immediately steps down to follower and adopts the new term. This single mechanism handles split brains, stale leaders, and partition recovery — whoever has the highest term wins.

Q4. How is the log committed? ▾

A leader appends an entry locally, replicates it to followers via AppendEntries. Once a majority of nodes have it persisted, the leader marks it "committed" and applies it to the state machine. Followers learn about the commit on the next heartbeat (the leader sends commitIndex). A committed entry is durable — even if the leader dies, any future leader will have it (election rules guarantee a candidate must have all committed entries to win).

Q5. Why 5 nodes (or 3) — why odd numbers? ▾

Quorum is `floor(N/2) + 1`. For 3 nodes you need 2 to commit (tolerates 1 failure). For 5 you need 3 (tolerates 2 failures). Adding even numbers doesn't help — 4 nodes need 3 votes (tolerates 1 failure, same as 3 nodes), but takes 33% more replication traffic. So odd numbers are strictly better unless you have specific reasons (geographic distribution).

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…

Loading comments…