Home / Cosmos DB / Foundations
F1
🌀
Lesson · 13 min

Replication & Consensus

How distributed databases stay in sync — Paxos, Raft, quorums, the math behind it.

TL;DR

Replication keeps multiple copies of your data so no single failure loses it; consensus protocols (Paxos, Raft) make all those copies agree on the order of writes despite network partitions and node crashes. Cosmos DB runs a Paxos-derived protocol per partition — every write hits a quorum of 4 replicas before it's acknowledged.

Key takeaways
  • Replication topologies fall into three families — leader-follower, multi-leader, leaderless. Pick based on whether you need single-region writes, multi-region writes, or maximum availability.
  • Consensus = "agree on the next operation". Paxos is the original; Raft is the modern, simpler descendant. Both require a majority quorum, both tolerate up to N/2 - 1 failures with 2N+1 nodes.
  • Quorum reads/writes (Dynamo-style) trade strict ordering for tunable availability — set R + W > N for strong consistency, or relax for speed.
  • Cosmos uses 4-replica sets per partition with majority quorum, plus an extra "tail" replica for read scaling. A write isn't durable until the quorum confirms.
  • Replication factor is the lever — more replicas = more durability + higher write latency + more storage cost. The math is identical across all distributed databases.

A database with one copy of your data has one bad day away from being a database with zero copies. Replication is the answer — keep data on multiple machines so any one of them can fail without losing anything. The complication is that “multiple copies” implies “they have to agree on which copy is correct”. That agreement problem is what consensus protocols solve.

This lesson is the foundational pair — replication topologies + consensus. Once you know these, every distributed database starts looking like a variation on the same theme.

The three replication topologies

Leader-follower (a.k.a. primary-replica, single-leader)

One node is the leader — every write goes there. Followers asynchronously (or synchronously) copy the leader’s write log. Reads can come from any replica.

        ┌──────────┐
write → │  Leader  │
        └────┬─────┘
             │ replicate log
        ┌────┴─────┬────────┐
        ▼          ▼        ▼
    Follower 1  Follower 2  Follower 3

Pros — simple, strong ordering, easy reasoning. Cons — leader is a write bottleneck and a single point of failure for writes (followers can keep serving reads). Failover requires a leader election (which is consensus).

Used by — PostgreSQL streaming replication, MySQL primary-replica, MongoDB replica sets, Kafka partitions.

Multi-leader

Multiple nodes accept writes, each replicating to the others. Useful when geographic distance makes a single leader too slow.

   write ←→ ┌──────────┐ ←→ ┌──────────┐ ←→ write
            │ Leader 1 │     │ Leader 2 │
            │  (US)    │     │  (EU)    │
            └──────────┘     └──────────┘

Pros — local-write latency, survives a single-leader failure with no election. Cons — conflict resolution. If two leaders accept conflicting writes (same key, same time), you need a strategy — last-write-wins, custom merge, CRDTs, manual reconciliation.

Used by — Cosmos DB multi-region writes, MySQL Galera, BDR, CouchDB, some Cassandra deployments.

Leaderless (Dynamo-style)

No node is special. Clients write to multiple replicas in parallel; reads also hit multiple replicas and reconcile.

                ┌──────────┐
       ┌──────→ │ Replica1 │
write  ├──────→ │ Replica2 │
       └──────→ │ Replica3 │
                └──────────┘

Pros — no leader to fail over, maximum availability, smooth handling of partitions. Cons — no canonical ordering. Conflicts are common; resolved with vector clocks, last-write-wins, or CRDTs.

Used by — Amazon DynamoDB, Cassandra, Riak, ScyllaDB.

Quorum: the unifier

All three topologies eventually use quorum to balance consistency vs availability:

  • N = total replicas
  • W = replicas that must acknowledge a write
  • R = replicas that must respond to a read

If R + W > N, you’re guaranteed strong consistency — any read intersects with at least one node from the most recent write. Below that threshold, reads can return stale data.

A common Dynamo-style setting — N=3, W=2, R=2 — gives you strong consistency while tolerating one node failure on either path. Tweak to N=3, W=1, R=1 for maximum write availability with eventual reads. Same database, different knobs per request.

Consensus: the formal version

Quorum gives you durability. Consensus gives you a totally ordered sequence of operations that every replica agrees on.

The classic consensus problem — given a cluster of N nodes that may crash, propose a value such that all non-crashed nodes eventually agree on the same chosen value, even if some messages are lost or delayed.

The two algorithms you’ll see referenced:

Paxos (Lamport, 1989)

The original, dense, hard-to-read paper. Multi-Paxos handles a sequence of values (a log) by reusing the leader between rounds. Used internally by Google (Chubby, Spanner), Microsoft (Cosmos’s protocol), and many academic systems.

Raft (Ongaro & Ousterhout, 2014)

Designed specifically to be simpler than Paxos. Three crisp roles:

  • Follower — passive, awaits messages
  • Candidate — temporarily campaigns to become leader
  • Leader — handles writes, replicates log, sends heartbeats

A leader is elected when followers stop hearing heartbeats and time out. Each term has at most one leader. Writes go through the leader, which appends to its log, replicates to a majority, then commits.

Raft is what etcd, Consul, CockroachDB, TiDB, and most newer distributed systems use. There’s an interactive Raft simulation in our visualization library — kill the leader and watch the next election.

How Cosmos DB applies all of this

Each Cosmos DB partition is a replica set of 4 — one primary, three secondaries — running a Paxos-derived consensus protocol internally. A write is acknowledged when a quorum (3 of 4) confirms it.

In a multi-region account, each region has its own 4-replica set. The replicas across regions are coordinated by a higher-level multi-master protocol that handles cross-region replication and conflict resolution. The result — local strong-consistency within a region, tunable cross-region consistency.

This is the same pattern most modern globally-distributed databases use — local quorum + global multi-master. Spanner does it with TrueTime atomic clocks. CockroachDB does it with HLCs and Raft. Cosmos does it with logical clocks and Paxos. The math underneath is the same.

What this gives you in production

  • Durability — losing one machine doesn’t lose any acknowledged write.
  • Availability — losing one machine doesn’t stop the cluster from accepting writes.
  • Ordering — every replica agrees on the order of operations within a partition.
  • Failover — when a primary dies, the cluster elects a new one in seconds, without operator intervention.

You don’t see any of this. From your application’s perspective, Cosmos is a single endpoint that accepts writes and returns reads. The four-node consensus and the multi-region coordination are the database’s problem. That’s the whole point of a managed system — you get the guarantees of a hand-tuned distributed cluster with a checkbox.

Where this connects

  • Raft Consensus — interactive viz of a Raft cluster electing a leader. Same shape as the Paxos protocol Cosmos uses internally.
  • CAP Theorem — replication forces you to choose between consistency and availability under partition. CP systems (Cosmos in strong mode) refuse some requests; AP systems (Cosmos in eventual mode) keep serving but may serve stale data.
  • F4 Cosmos Architecture — how the replica set sits inside the broader Cosmos topology (gateway, partition manager, federation).
  • F5 Consistency Models — given replication, what consistency levels can the system actually offer?

The takeaway — replication and consensus are the bedrock of every distributed database. Once you know how a 4-replica quorum works, you understand the spine of every system from PostgreSQL streaming replication to Spanner.

🎯 Common questions
Q1. What's the difference between replication and consensus?

Replication is the act of copying data to multiple nodes. Consensus is the act of agreeing on what gets copied (and in what order). You can replicate without consensus — just async ship a log to a follower — but you'll lose data on failover. Real distributed databases combine both — replicate for durability, consensus for ordering.

Q2. Why does consensus need a majority?

To prevent split-brain. If a 5-node cluster partitions into 3 + 2, only the 3-node side has a majority and can elect a new leader. The 2-node side knows it doesn't have quorum and refuses to make decisions. When the partition heals, the 2-node side catches up to the 3-node side's log. Without majority, both sides might elect competing leaders and accept conflicting writes.

Q3. How does Cosmos pick which replica to read from?

For strong consistency, the read must hit the leader (or a quorum of replicas). For session/eventual, reads can hit any replica — typically the closest one geographically. The SDK's connection-mode setting (`Direct` vs `Gateway`) determines whether the client reaches replicas directly or through a load-balancing gateway.

Q4. What's the trade-off between Raft and Paxos?

Paxos is older and more general — multi-Paxos handles a sequence of values, fast Paxos optimizes round-trips, etc. Raft was designed in 2014 specifically to be more understandable, with crisp roles (leader, follower, candidate) and clear state transitions. Most modern systems (etcd, Consul, CockroachDB) use Raft. Microsoft's internal protocols are Paxos-derivatives — established before Raft existed.

Q5. How many replicas should I have?

For each independent failure you want to tolerate, you need 2 more replicas. Tolerate 1 failure → 3 replicas. Tolerate 2 → 5 replicas. Beyond 5, write latency dominates — every additional replica is one more node a write must reach. Cosmos uses 4 replicas as a sweet spot — tolerates 1 failure with majority quorum, adds a 4th for read scaling.

📺 Video

The lesson video is on YouTube — coming once the upload goes public.

🧪 Simulator

A live simulator for this lesson's mechanic (e.g. RU calculator, partition-key picker). Coming in Phase 2.

💻 Code

A copy-paste reference snippet plus a short build challenge.

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…
Loading comments…