Cosmos DB Architecture (4-Layer Path) — Cosmos DB

TL;DR

A request to Cosmos DB travels through four layers — Gateway (auth + routing), Federation (resource resolution), Partition Manager (per-partition replica set), and Storage Engine (Bw-tree on disk). Knowing what happens at each layer is the difference between guessing why something is slow and seeing exactly which layer to look at.

Key takeaways

▸Layer 1 — Gateway. Public endpoint. Handles auth, routing, retries, region selection. Adds ~5–10 ms over Direct mode.
▸Layer 2 — Federation. Cluster-level metadata. Maps account → database → container → physical partitions. Caches aggressively; rare to be on the critical path.
▸Layer 3 — Partition Manager. Each partition is a 4-replica set running Paxos-derived consensus. The actual write/read happens here. Latency budget — 1–5 ms within region.
▸Layer 4 — Storage Engine. Per-replica Bw-tree on local SSD. Indexing, page cache, durability all live here.
▸Direct mode (TCP) talks directly to partition replicas; Gateway mode (HTTPS) routes through Layer 1. Direct is faster but requires open TCP ports. Most apps should use Direct in production.

When you call await container.ReadItemAsync(id, partitionKey), what actually happens? A surprising amount, distributed across four layers each running on different hardware, often in different physical buildings. Each layer adds latency and capability; understanding the path makes operational debugging tractable.

This lesson maps the request flow end-to-end.

The big picture

   ┌─────────────────────────────────────────────────────────────┐
   │  Your App  (CosmosClient SDK, holds routing cache)          │
   └────────────────────┬────────────────────────────────────────┘
                        │
              Gateway mode  or  Direct mode (TCP)
                        │            │
                        ▼            │
   ┌─────────────────────────┐       │
   │  Layer 1: Gateway       │       │
   │  Auth · Routing · Retry │       │
   └─────────┬───────────────┘       │
             │                       │
             ▼                       ▼
   ┌─────────────────────────────────────────────┐
   │  Layer 2: Federation                        │
   │  Account → Database → Container → Partition │
   └─────────────────────┬───────────────────────┘
                         │ "go to physical partition X"
                         ▼
   ┌─────────────────────────────────────────────┐
   │  Layer 3: Partition Manager                 │
   │  4-replica set · Paxos · primary + 3 sec.   │
   └─────────────────────┬───────────────────────┘
                         │ "do the read/write"
                         ▼
   ┌─────────────────────────────────────────────┐
   │  Layer 4: Storage Engine                    │
   │  Bw-tree · index · page cache · WAL         │
   └─────────────────────────────────────────────┘

Each layer has a clear, separable responsibility. Each can fail independently and degrades the system in a recognizable way.

Layer 1: The Gateway

The Gateway is the public face of Cosmos. Every account has a regional gateway service (<your-account>.documents.azure.com) that handles:

Authentication — validates the master key signature or AAD token on every request.
Authorization — applies RBAC role assignments.
Routing — determines which physical partition a request belongs to (queries the federation layer for the routing table).
Region selection — for multi-region accounts, picks the closest write or read region.
Retry logic — handles transient faults inside the cluster.

In Gateway mode, your SDK talks only to the Gateway — every request is HTTPS POST to documents.azure.com. The Gateway then forwards to the right partition via internal TCP.

In Direct mode, your SDK still talks to the Gateway initially (for connection setup and routing-cache refresh), but actual reads and writes go straight to partition replicas via TCP. The Gateway becomes a sideline service for metadata only.

Latency cost — Gateway mode adds 5–10 ms per request because every operation does an extra hop. Direct mode skips that hop.

Layer 2: Federation

A “federation” is Cosmos’s term for a cluster of physical machines that host many partitions across many tenants. The Federation layer maintains the global routing table:

Account "myapp"
  Database "prod"
    Container "users"
      Partition Key Range [00..00 → 7F..FF]  →  Federation 12, Replica Set 5482
      Partition Key Range [80..00 → FF..FF]  →  Federation 12, Replica Set 5483
    Container "orders"
      ...

When a partition splits, the Federation layer updates the table; SDK clients refresh their cache the next time they hit a stale entry. The Federation layer also coordinates partition migration when one federation runs out of capacity.

You never see this layer. It’s an internal control plane. The only time it matters is during very rare events like tenant migration, where you might see a brief routing-cache stale window.

Layer 3: Partition Manager (per-partition replica set)

Each partition is a replica set of 4 — one primary, three secondaries — running on different fault domains within a region. This layer:

Runs the consensus protocol (Paxos-derived) to agree on the order of writes.
Handles leader election when the primary fails.
Coordinates replication of the write log to secondaries.
Manages read load balancing across replicas (within consistency constraints).

A write here looks like:

SDK or Gateway sends write to the primary.
Primary writes to its WAL on local SSD (durable).
Primary replicates the WAL entry to the three secondaries.
Once 3 of 4 replicas (quorum) confirm, the write is committed.
Primary acks the SDK.

This entire dance typically takes 2–5 ms within a region. Most of that is replica round-trips; the local SSD write is microseconds.

The replica set is what gives Cosmos its durability + availability guarantees. Lose one machine — the cluster keeps writing. Lose the primary — a new one is elected from the secondaries in seconds. Lose two of four — writes block until a replica recovers (rare but possible in major cloud incidents).

Layer 4: Storage Engine

Inside each replica is the actual on-disk data. This is the Bw-tree we discussed in F2:

Per-container Bw-tree for the data.
Per-indexed-path Bw-tree for the index.
WAL on local SSD for durability before commit.
Page cache in RAM for hot pages.

When a read arrives, the storage engine:

Looks up the document in the Bw-tree (1–2 page reads, often cache hits).
Applies any pending deltas (lock-free read of the chain).
Returns the materialized document.

When a write arrives (after consensus has committed it):

Appends the WAL entry to a local commit log.
Adds a delta to the Bw-tree’s mapping table via CAS.
Updates each affected index’s Bw-tree similarly.
Acks “durable” back up to the partition manager.

The storage engine is also where indexing policies apply (V07) — every JSON path in the indexing policy is its own Bw-tree, and every write ripples through all of them.

Reading the layer in error messages

When something goes wrong, you can usually tell which layer is to blame from the symptoms:

Symptom	Likely layer
401 / 403	Layer 1 — auth/RBAC
Connection timeout to `*.documents.azure.com`	Layer 1 — gateway DNS or firewall
410 GoneException with partition-key-range mismatch	Layer 2 — stale routing cache, SDK will refresh
503 ServiceUnavailable that recovers in seconds	Layer 3 — replica failover
429 TooManyRequests	Layer 3 — partition exceeding RU/s allocation
Slow queries on specific paths only	Layer 4 — index missing or unselective
Occasional latency spikes for specific keys	Layer 4 — Bw-tree page consolidation

The SDK exposes diagnostic info that pinpoints the layer — CosmosDiagnostics in .NET, getDiagnostics() in Java/Node — and printing it on slow queries is one of the highest-leverage habits an experienced Cosmos engineer develops.

What changes for multi-region

In a multi-region account, each region has its own full stack — its own Gateway, Federation, Partition Manager, and Storage. The replication that crosses regions is async (for session and weaker consistency) or sync (for strong consistency).

The SDK’s PreferredLocations setting determines which region’s Gateway you talk to first. Reads can be local; writes go to the write-region (single-master) or the closest region (multi-master).

V12 (Global Distribution) drills into the cross-region propagation in detail.

Where this connects

F1 Replication — Layer 3 (the replica set) is the consensus + replication. F1 explains how 4 replicas stay in sync.
F2 Storage Engines — Layer 4 is the Bw-tree. F2 explains why it’s an append-friendly tree.
F3 Partitioning — Layer 2’s routing table is built on the hash-range scheme F3 describes.
V08 SDK Best Practices — most production tuning happens at Layer 1’s SDK config (singleton client, connection mode, retry policy).
V12 Global Distribution — what cross-region propagation looks like, and the trade-offs.

Once the four-layer model is in your head, every Cosmos issue points at a specific layer. That’s the entire point of building this mental model — diagnosis becomes mechanical.

🎯 Common questions

Q1. What's the difference between Gateway mode and Direct mode in the SDK? ▾

Gateway mode — every request goes through the regional gateway service. Pros — works through corporate firewalls (only HTTPS needed), simpler debugging. Cons — adds 5–10 ms per request, gateway can become a bottleneck under high load. Direct mode — the SDK opens TCP connections directly to the partition replicas it needs. Pros — lowest latency, best throughput. Cons — needs outbound TCP on ports 10000–20000. Use Direct in production unless your network forbids it.

Q2. Where does authentication happen? ▾

At the Gateway (Layer 1). The SDK signs each request with the master key (or AAD token); the Gateway validates the signature, checks RBAC permissions if applicable, and either routes the request or rejects with 401/403. Direct mode connections still authenticate, but via a session-token model bootstrapped through the Gateway on connection establishment.

Q3. What does "federation" actually do? ▾

A federation is a cluster of physical machines hosting many partitions for many accounts. The federation layer's job is to know which physical partition each container's data lives on, and to coordinate splits/merges/migrations as load shifts. From your point of view, it's the meta-routing — it answers "for partition-key X in container Y, which replica set should I talk to?". Cached in the SDK; rare to hit on the hot path.

Q4. How does a write actually get persisted? ▾

(1) SDK sends write to primary replica via Direct mode TCP. (2) Primary appends to its local WAL (write-ahead log) on SSD. (3) Primary replicates the WAL entry to secondaries via Paxos protocol. (4) Once a quorum (3 of 4) confirms, primary commits the write to its Bw-tree. (5) Primary acks back to SDK. Total — typically 3–5 ms within a region. Multi-region writes add the cross-region propagation, which is async by default.

Q5. What's the role of the Cosmos client SDK? ▾

More than you might think. The SDK caches the routing table (which partition is on which machine), implements connection pooling for Direct mode, applies retry-with-backoff on 429/503, and routes queries to the right partition based on the partition-key in the query. Treat the CosmosClient as a singleton — instantiating one is expensive (it has to bootstrap the routing cache), and you never want a connection storm at startup.

Q6. Why does my first query after an idle period take 100+ ms? ▾

TCP connection re-establishment. Direct-mode connections idle out after a period (typically 5–10 minutes); the next request has to redo TCP handshake + TLS + routing-cache validation. Workarounds — set a longer idle timeout, send periodic warm-up reads, or use ConnectionPolicy's `IdleConnectionTimeout`. The fix is in the SDK config, not the database.