Consistency Models — Cosmos DB

TL;DR

A consistency model is the contract a database makes about what reads can see relative to writes. Strong (linearizable) means every read sees the latest write — slow but airtight. Eventual means replicas converge eventually — fast but reads can be stale. Cosmos exposes five named levels — Strong, Bounded Staleness, Session, Consistent Prefix, Eventual — that span the spectrum, picked per-account with per-request overrides.

Key takeaways

▸Strong consistency = linearizability. Every read returns the latest committed value, globally. The price — at multi-region, every read must reach a quorum spanning regions.
▸Eventual = "replicas converge if writes stop arriving". Cheap, fast, but reads can return arbitrarily old data and you can see writes out of order.
▸Session consistency = read-your-writes within a single client session. Other sessions may see staleness. This is the default and what 90% of apps want.
▸Bounded staleness = "stale by at most K versions or T seconds". Useful when you can tolerate some lag but need a predictable upper bound (leaderboards, dashboards).
▸Consistent prefix = "you'll see writes in commit order, even if not the latest". Good for feeds and comment threads where order matters more than freshness.
▸PACELC adds nuance to CAP — even when there's no Partition, you trade Latency for Consistency. Cosmos surfaces both axes as explicit knobs.

If replication and consensus are the mechanism, consistency is the promise. A consistency model is what the database guarantees you’ll see when you read after a write. The whole spectrum exists because there’s no free lunch — stronger guarantees cost more latency and more availability. Different applications have different tolerances. The art is matching guarantee to need.

Cosmos DB stands out partly because it exposes five named consistency levels with crisp definitions and per-request overrides — which is rarer than you’d think.

The spectrum, illustrated

Imagine a single key whose value progresses through a series of writes:

Time:    →
Writes:  W1=A    W2=B    W3=C    W4=D    W5=E

A reader hitting any replica at time T might see, depending on consistency:

Model	Reader sees
Strong	The latest write that has been committed by quorum at time T. Always.
Bounded staleness (K=2)	Anything from W3 → W5 (last 2 versions)
Session	The latest write this session has performed. Possibly older for writes by other sessions.
Consistent prefix	A subsequence in commit order — could be `[A, B, C]` or `[A, B]`, never `[A, C]` (skipping B).
Eventual	Anything. Could be `A` even after `E` was written. Order not guaranteed.

The five levels span the trade-off — guarantee strength on one axis, latency + availability cost on the other.

The five named levels in detail

Strong (linearizability)

Every read returns the result of the most recently committed write, as if the entire system were a single node. Formally — a read at time T returns whatever write was committed last in real-time before T.

What it costs you — in a multi-region account, every read has to reach a quorum that spans regions. That’s a cross-region round-trip per read. Realistic latency — 100+ ms for a multi-region setup.

What it gives you — total ordering. You can use Cosmos as a coordination primitive (lock service, sequence generator, ledger).

Bounded staleness

You configure two parameters — K (number of stale versions) and T (staleness duration). Reads can lag the latest write by at most K versions or T seconds, whichever is hit first.

Use cases

Multi-region read replicas where you need low latency but can tolerate “fresh-ish” data.
Leaderboards, dashboards, analytics — where K=100 versions or T=60 seconds is fine.

Caveat — bounded staleness is monotonic. Once you read version V, you’ll never see anything older than V on the same client.

Session

The default. Within a single client session (defined by a session token returned by the SDK), reads see all of your writes — read-your-writes consistency. Other sessions may see staler data.

Why it’s the default — it solves the most jarring user-facing problem (post-then-can’t-see-post) at near-eventual cost. The session token is small (a few bytes per partition); its overhead is invisible.

Tip — session token is per-CosmosClient instance. If you serve traffic across multiple processes (typical web app), each has its own session — they don’t see each other’s writes consistently. Mitigate by routing requests for one user through one app instance, or by accepting that “session” is per-machine.

Consistent prefix

Reads always see writes in commit order, but possibly not the latest. You’ll see A, B, C or A, B but never A, C (which would mean B was skipped).

Use cases

Comment threads, message feeds, audit logs — places where order matters more than freshness.
Replaying events in order.

Eventual

The weakest guarantee. Writes propagate eventually; reads can see anything in transit. Replicas converge iff writes stop arriving long enough.

Use cases

View counts, like buttons, notification badges — anywhere stale-by-seconds is invisible.
Background analytics, log aggregation.

The SDK still tries to read from the closest replica — eventual is fast but accepts you might briefly see old data.

Linearizability vs serializability

These two terms get conflated constantly. Distinct concepts:

Linearizability — about ordering of single-object operations across distributed replicas. “After my write, every read sees the new value, immediately.”
Serializability — about ordering of multi-object transactions. “The result is equivalent to some serial ordering of all concurrent transactions.”

A database can be:

Linearizable + serializable (Spanner, CockroachDB at SERIALIZABLE)
Linearizable + not serializable (Cosmos at Strong, single-key only — multi-key transactions need explicit batches)
Serializable + not linearizable (Postgres at SERIALIZABLE in a single-node deployment — no replicas to be inconsistent across)

Cosmos’s transactional batch API gives you serializability within a partition (because all docs share a replica set). It does not give you cross-partition serializability — that’s a known limitation.

CAP, PACELC, and the real trade-off

The CAP theorem is well-known — Consistency, Availability, Partition tolerance — pick two. But CAP only applies during a partition, which is rare.

PACELC (“pa-celk”) extends it — even when there’s Else (no partition), you trade Latency for Consistency. This is the trade-off you actually live with day-to-day.

A multi-region Cosmos account at Strong consistency:

Under partition — favors Consistency (refuses some reads/writes; this is CP)
In normal operation — favors Consistency (at the cost of higher Latency, since reads need cross-region quorum)

The same account at Session:

Under partition — favors Availability (replies with possibly-stale local data; this is AP)
In normal operation — favors Latency (local reads are fast)

Practical takeaway — when picking a consistency level, ask both questions:

Under partition, do I prefer to refuse the request or serve stale data?
In normal operation, am I willing to pay cross-region latency for cross-region freshness?

The five-level Cosmos menu lets you answer both per request.

Picking a level

A heuristic that works for most apps:

App type	Default	Override
User-facing CRUD app	Session	Strong on the rare must-be-fresh read
Real-time dashboard	Bounded staleness (K=100, T=30s)	—
Financial / ledger	Strong	—
Analytics / reporting	Eventual	—
Social feed	Consistent prefix	—
IoT telemetry	Eventual	—

The default should be the laxest you can tolerate; tighten on the specific operations that need it. Tightening the default means you’re paying tighter-consistency cost on every request, even ones that don’t need it.

Where this connects

F1 Replication & Consensus — consistency models are what consensus protocols make possible. Strong consistency requires a quorum write before ack. Session relies on session tokens (vector clocks). Eventual relies only on async propagation.
CAP Theorem visualization — interactive triangle showing CP / AP / CA trade-offs.
V04 Consistency Levels — applied lesson — when to override per-request and how the SDK exposes the levels.
V12 Global Distribution — multi-region adds a new dimension; the same five levels apply, but propagation times become first-class.

If you remember one thing — the right consistency level is the laxest one you can tolerate, because every notch tighter pays measurable latency and availability cost. Stronger isn’t safer; it’s more expensive. Match the model to the use case.

🎯 Common questions

Q1. What's the difference between linearizability and serializability? ▾

Linearizability is about *single-key* operations across distributed replicas — every read sees the latest write to that key, in real-time order. Serializability is about *multi-key transactions* — the result is equivalent to some serial ordering of transactions. They're orthogonal. A database can be linearizable but not serializable (most NoSQL with strong consistency), or serializable but not linearizable (many SQL databases at non-strict isolation levels). Cosmos's strong-consistency level is linearizable; transactional batches within a partition add serializability for that scope.

Q2. Why is session consistency the right default for most apps? ▾

Because it captures the *user-visible* invariant — "when I create a thing, I see it in the next request". Cross-session staleness (you see your friend's post 200 ms late) is invisible to users; intra-session staleness (you create a post and don't see it for 200 ms) is jarring. Session consistency gives you the user-facing strong guarantee at eventual-consistency cost, by tracking a per-session token in the SDK.

Q3. How does Cosmos implement session consistency under the hood? ▾

Every write returns a session token (essentially a vector clock entry). Subsequent reads from the same client send that token; replicas only serve the read if they've seen at least that version. If they haven't, they wait a brief window or forward to a more up-to-date replica. The session token travels in HTTP headers and is opaque to your code.

Q4. What's PACELC and why should I care? ▾

An extension of CAP. CAP says — in a Partition, you choose Availability or Consistency. PACELC adds — Else (no partition), you choose Latency or Consistency. Even a healthy multi-region cluster can't give you both strong consistency AND low latency — strong reads must reach a global quorum, which costs a cross-region round-trip. Most production decisions are PACELC trade-offs, not CAP trade-offs (because you rarely have actual partitions, but you always have latency).

Q5. When would I actually use Strong consistency? ▾

Three patterns. (1) Financial — balances, ledgers, anything where reading a stale value is dangerous. (2) Leases / locks — only one process should think it holds the lock at a time. (3) Sequence generation — global counters where every reader must see the latest value. For everything else (UI reads, feeds, dashboards, analytics), strong is overkill and pays unnecessary latency.

Q6. Can I mix consistency levels? ▾

Yes — and you should. Set a default at the account level (typically Session); override per-request when needed. Pattern — Session for normal user reads, Strong for the rare "must-be-fresh" read (e.g., balance check before a payment), Eventual for analytics or background jobs. The SDK exposes this as `RequestOptions.ConsistencyLevel`.