Partitioning Internals — Cosmos DB

TL;DR

Partitioning is how a distributed database splits one logical container across many physical machines. Three classic strategies — range, hash, and directory — each with different trade-offs around uniformity, range queries, and rebalancing. Cosmos DB combines hash partitioning (for distribution) with range partitioning (for splittability) in a hybrid scheme called hash-range.

Key takeaways

▸Range partitioning — keys split into contiguous ranges. Great for range scans, terrible for distribution if writes cluster (e.g., timestamps).
▸Hash partitioning — keys hashed to a uniform space. Great for distribution, terrible for range scans (consecutive keys land on different partitions).
▸Directory partitioning — explicit map of key → partition. Maximally flexible, requires a metadata service that itself must be distributed and replicated.
▸Hash-range (Cosmos's choice) — hash to a 64-bit space, then range-partition that space into hash-buckets. Get hash distribution + the ability to split a range when one bucket grows.
▸Rebalancing is the operational pain. When you add or remove a node, some partitions must move. The good schemes (consistent hashing, hash-range) move only a small fraction of data.

Partitioning is the mechanism that lets one logical database span dozens or thousands of machines. It’s the difference between a database that hits a write ceiling at ~50,000 ops/sec on a single node and one that scales linearly to millions. Every distributed database has a partitioning scheme; the differences between them explain a lot about how they feel in production.

This lesson goes a level below the practical V02 Partitioning lesson — into the strategies themselves and how Cosmos’s specific scheme works.

The three classical strategies

1. Range partitioning

Sort all keys into a global order. Split the range into contiguous chunks; each chunk is one partition.

Keys: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

       Partition 1   Partition 2   Partition 3   Partition 4
       [A → F]       [G → L]       [M → R]       [S → Z]

Pros

Range scans are trivial — WHERE name BETWEEN 'C' AND 'F' hits exactly one partition.
Splits are easy — divide the range in half.

Cons

Hotspots are guaranteed. If keys are timestamps, all today’s writes hit one partition. If keys are sequential IDs, all new writes hit the highest partition.
Skew is hard to fix. A long-tail customer or a viral event reshapes the distribution.

Used by — HBase, BigTable, Spanner. They mitigate hotspots with techniques like salting and pre-splitting.

2. Hash partitioning

Apply a hash function (FNV, MD5, MurmurHash, etc.) to each key. Map the resulting hash space to N partitions, typically by partition = hash(key) mod N.

hash("user-42") = 0x8a3f...
                       │
                       ▼
                  partition 3 of 8

Pros

Uniform distribution of keys across partitions, independent of the key’s natural ordering.
Hot partitions are rare unless one key is itself hot (which only redistribution can fix).

Cons

Range scans are catastrophic — consecutive keys hash to different partitions, so a range scan must hit every partition.
Naive mod N rebalancing is brutal — adding a node shifts almost every key’s hash assignment.

Used naively by — early Memcached, application-layer sharding.

3. Directory partitioning

Maintain an explicit lookup table — partition_for(key) returns the partition. The table itself is stored in a coordinator service.

Pros

Maximum flexibility. Any key can go anywhere; load can be balanced surgically.
Perfect rebalancing — move one key, update one map entry.

Cons

The directory itself is a distributed system problem. It must be replicated, consistent, and survive failures. ZooKeeper and etcd both arose partly to solve this.
Lookup overhead on every read/write — usually cached aggressively.

Used by — early Spanner (Bigtable’s directory), Vitess (key-range mappings), some HDFS variants.

Consistent hashing: the third path

Consistent hashing fixes hash partitioning’s biggest weakness — the rebalance pain. Instead of mod N, both keys and nodes are placed on a circular hash space. A key belongs to the next node clockwise.

       Node A (hash = 0x10)
      ╱                       ╲
   keys 0xF0...0x10           keys 0x10...0x40
      ╲                       ╱
        Node D — Node B (0x40)
       (0xC0)
       ╱     ╲
     keys      keys 0x40...0x80
   0x80...0xC0    ╲       ╱
                Node C (0x80)

When you add a node, you place it somewhere on the ring; only keys between the new node and its clockwise neighbor move. When you remove a node, only its keys go to its clockwise successor. Total data movement: 1/N.

The polish — virtual nodes. Each physical node owns many small ring positions (typically 100–256), so failures spread the orphaned load across many surviving nodes rather than dumping it all on one neighbor.

Interactive consistent-hashing viz here — slide the vnode count, watch the balance score change.

Used by — Amazon DynamoDB, Cassandra, Riak, memcached clients (with vnodes), Akka clusters.

Hash-range: Cosmos’s hybrid

Cosmos uses a scheme that combines hashing’s distribution with range partitioning’s splittability:

Hash the partition-key value into a 64-bit hash space.
Divide that hash space into contiguous hash ranges — initially ~256 ranges.
Map each hash range to a physical partition.
When a hash range outgrows its physical partition (50 GB or 10K RU/s), split the range in half.

Hash space: [00..00 ──── 7F..FF ──── FF..FF]

           Range 1            Range 2          ← logical, fixed bucket count
           ↓ assigned to ↓
       Physical Part 1    Physical Part 2     ← physical, grows by splitting

After Range 1 grows past its limit:

           [Range 1a] [Range 1b]   Range 2
              ↓          ↓            ↓
           Phys 1    Phys 1.5      Phys 2

Why this works

Hashing gives uniform distribution across the 256 hash ranges (no skew unless one partition-key value is itself hot).
Range partitioning of the hash space means we can split when needed — half the keys move to a fresh physical partition.
Routing is just a binary search through the (small) hash-range table.
Adding capacity is incremental — you don’t re-hash anything; you split a single range.

The 50 GB and 10K RU/s logical-partition limits are independent of physical-partition splits. A single partition-key value (say tenantId = "ACME") can never span multiple physical partitions — once ACME hits 50 GB, writes for ACME start failing. That’s why a power-law tenant in V02 is so dangerous.

Rebalancing: the deciding factor

When you compare partitioning schemes, the question isn’t usually “which is fastest” — they’re all fast enough. The question is what happens when the cluster changes.

Scheme	Add a node	Remove a node	Rebalance hot keys
Naive `mod N` hash	rehash everything	rehash everything	impossible
Range	move boundary keys	reclaim range	impossible without explicit splits
Consistent hashing	move 1/N keys	move dead node’s keys	hard, needs vnode redistribution
Directory	move 1 key, 1 map entry	move dead node’s keys	trivial
Hash-range	split a single range, no other movement	merge ranges	split a single range

Cosmos’s hash-range wins on practical operability. Splits are cheap and localized. Adding capacity is just splitting more ranges. Removing capacity is merging — also localized.

What this teaches about other systems

Look at any distributed database and you can decode it via these primitives:

DynamoDB — consistent hashing with vnodes. Adding/removing capacity moves a small slice.
Cassandra — same as DynamoDB. The Dynamo paper inspired both.
HBase / BigTable — pure range partitioning. Pre-split or salt to avoid hotspots.
Spanner — range partitioning with TrueTime + automatic splits. Explicit directory above.
Cosmos / Spanner internal layout — hash-range or range-with-explicit-splits.
MongoDB — supports both hash and range; you choose.

The takeaway — partitioning isn’t a black box. The scheme determines how the system feels under growth, and once you can name the scheme, you can predict the behavior.

Where this connects

V02 Partitioning — the practical lesson. Once you know the engine works on hash-range, V02’s rules (high cardinality, even distribution, query alignment) all make mechanical sense.
Consistent hashing visualization — a related-but-different scheme used by Cassandra and DynamoDB.
F4 Cosmos Architecture — where the partition manager and routing table live in the broader system.
F1 Replication — each physical partition is a 4-replica set. Partitioning + replication = the spine of any distributed database.

Pick a partition scheme like you pick a database. Most of your operational pain over the next five years is determined by what happens here.

🎯 Common questions

Q1. Why is hashing the partition key better than just hashing the document? ▾

You want all documents that share a partition-key value to land on the same partition (so transactions and queries within that key stay local). If you hashed the whole document, two docs with the same `userId` would hash to different physical partitions, defeating the locality you want. Hashing only the partition key gives you both — uniform distribution across keys, locality within a key.

Q2. What's consistent hashing and is that what Cosmos uses? ▾

Consistent hashing maps both keys and nodes onto a circular hash space. A key belongs to the next node clockwise. Adding/removing a node only moves keys between adjacent nodes — typically 1/N of the data. Cosmos's hash-range scheme is *similar in spirit* — it splits a hash space into ranges — but uses range-bounded buckets rather than a continuous ring. Both achieve the same goal — minimal data movement on rebalance. There's a [visualization of consistent hashing here](/concepts/consistent-hashing).

Q3. How does Cosmos handle a partition that grows past 50 GB? ▾

It splits. The partition's hash range is divided in half, half the data moves to a new physical partition, and routing tables update transparently. The split takes seconds-to-minutes depending on data size; reads and writes continue throughout (briefly served by both old and new partitions during the cutover). You see no operational signal — the only hint is that you might suddenly see better RU/s headroom because the workload now spans more partitions.

Q4. Can I see which partition my data is on? ▾

Indirectly — the SDK exposes a `PartitionKeyRange` for each query, and you can list ranges via the management API. But you generally shouldn't rely on this. Cosmos can split or merge ranges at any time, and your code should be agnostic. The only exception is parallel readers (e.g., the Change Feed Processor) that explicitly distribute work across ranges.

Q5. What's "fan-out" and why does it matter? ▾

A query that doesn't include the partition key has to be sent to *every* partition. Each partition runs the query independently and returns matches; the gateway merges. With 4 partitions, that's 4× the I/O. With 100 partitions, it's 100× — and one slow partition can drag the whole query's tail latency. Fan-out is the architectural reason cross-partition queries cost 10–30× a single-partition query in RUs.