Partitioning: The #1 Decision — Cosmos DB

TL;DR

One bad partition-key choice multiplies your bill by 10× and forces a full container migration to fix. Pick a key with high cardinality, even distribution, and that aligns with your most frequent query — get all three right and Cosmos scales horizontally with no further thought.

Key takeaways

▸Partition keys are immutable. You cannot ALTER PARTITION KEY. Changing it means a new container, copy every doc, dual-write or downtime.
▸A good key has three properties — high cardinality, even distribution of writes/storage, alignment with the most frequent query. Drop any one and you'll feel it in production.
▸Logical partitions are limited to 20 GB and 10,000 RU/s. Hit either ceiling on a single partition-key value and writes start failing.
▸Cross-partition queries fan out and can cost 10–30× a single-partition query. Your hottest query path must include the partition key.
▸When a natural key doesn't fit, synthesize one — concatenate two fields, append a random 0–N suffix for write-heavy ingestion, or hash for maximally even distribution.

If Cosmos has one decision that compounds — the kind where getting it wrong forces a migration six months in — it’s the partition key. Everything else (consistency level, indexing, throughput) you can change later. The partition key is set at container creation and never changes. So this lesson matters more than the order it appears in.

Logical vs. physical partitions

Cosmos splits one container’s data across many physical machines, but it abstracts that with a two-layer model:

A logical partition is the bucket of all documents that share the same partition-key value. If your key is /tenantId, then every doc with tenantId = "ACME" lives in one logical partition. Limit — 20 GB and 10,000 RU/s.
A physical partition is the actual hardware (a replica set running on Microsoft’s compute fleet). Each physical partition holds many logical partitions, up to its own scaling limit. Cosmos automatically splits a physical partition when it grows — you never see this happen, you just see your data keep working.

A hash function maps the partition-key value to one of ~256 hash buckets, and those buckets in turn map onto physical partitions. You only choose the partition key; everything else is automatic.

The three rules of a good partition key

1. High cardinality

You need many distinct values, ideally orders of magnitude more than your physical-partition count. A /status field with three values ('pending', 'shipped', 'cancelled') puts every document into one of three logical partitions — guaranteed pile-up.

A natural high-cardinality key is something like:

/userId, /tenantId, /orderId, /sessionId
/eventId for event sourcing
/deviceId for IoT

2. Even distribution

High cardinality alone isn’t enough — the values also have to be uniformly distributed. /customerId looks great until you discover one enterprise customer accounts for 80 % of your data. That customer’s logical partition will throttle while everyone else’s sits idle.

Watch out for:

Power-law distributions — your largest customer is 1000× the median (B2B almost always)
Time-based skew — /createdDate puts today’s traffic on a single partition
Sequential IDs — auto-increment integers cluster by hash

The fix is usually a synthetic key (see below).

3. Query alignment

Your most-frequent query has to include the partition key in its WHERE clause. If it doesn’t, the query becomes cross-partition — a fan-out across every physical partition followed by a result merge. A cross-partition query reading 10 docs costs 5–10× more RUs than the same query against one partition.

A common trap — picking /userId because it’s high-cardinality and even, but your hottest query is WHERE orderId = '...'. Now every read is cross-partition. The cost on day one is invisible at low traffic. By the time it’s loud, you’re paying 30× what you should.

Hot partitions in practice

Here’s the canonical failure mode:

You provision 10,000 RU/s for a container with 4 physical partitions. Cosmos splits that as 2,500 RU/s per partition.

80 % of your traffic hits the partition holding /region = "NYC". That partition needs 8,000 RU/s but only has 2,500. Result — HTTP 429 throttling on the NYC region while three other partitions sit at 5 % utilization.

Provisioning more RU/s does nothing — the bottleneck is the per-partition cap.

The only fix is to repartition with a key that spreads writes evenly. If you discover this on day 90, you’re looking at a multi-week migration.

Synthetic keys: when natural keys fail

Three patterns worth knowing:

Concatenation

partitionKey = `${tenantId}_${date}`

Splits one tenant’s data across days. Reads by tenantId alone become cross-partition, so the trade-off is real — but it works when one tenant otherwise dominates.

Random suffix

partitionKey = `${userId}_${Math.floor(Math.random() * 10)}`

Spreads writes for one user across 10 buckets. Use for write-heavy workloads (telemetry ingestion, sensor data) where reads are rare or batch. Reads now have to fan out across the 10 buckets to assemble all of one user’s data.

Hash prefix

partitionKey = `${sha1(userId).slice(0, 4)}_${userId}`

A 4-character prefix gives you 16⁴ = 65,536 buckets — maximally even. The user is still recoverable from the suffix. Best when you genuinely have no natural distribution and need engineered uniformity.

Hierarchical partition keys

Newer Cosmos containers support hierarchical keys — up to three nested levels:

/tenantId
/tenantId, /userId
/tenantId, /userId, /sessionId

A query that includes tenantId is efficient. A query that includes tenantId and userId is more efficient (smaller scan). A query that only filters by userId is not efficient (it fans out).

Hierarchical keys are great when you have a clear “tenant → user → session” type hierarchy in your access pattern. The catch — they must be defined at container creation; you can’t add them to an existing container.

How to actually pick

A simple decision tree:

Pick the field that appears in your most frequent query’s WHERE clause. If multiple — pick the one with the highest cardinality.
Plot the distribution. If a single value would hold > 5 % of your data or traffic, you have skew. Synthesize.
Check the long-run forecast. Will any single value approach 20 GB in two years? If yes, synthesize a date or bucket suffix now.
Pressure-test cross-partition. What % of your queries don’t include the key? If > 10 %, reconsider.

What’s coming next

Lesson V03 (Data Modeling) — once you have a partition key, decide what to embed vs reference inside that partition.
Lesson V06 (Querying) — how to write queries that stay single-partition, and what to do when you can’t.
Lesson V07 (Indexing) — opt out of indexed paths to cut write costs.
Lesson V09 (Request Units) — the math behind the RU costs you’re paying or wasting.

Pick a partition key like you pick a database. You’re stuck with it.

🎯 Common questions

Q1. I picked the wrong partition key. What now? ▾

There's no in-place fix. You create a new container with the right key, run a one-time copy job (Change Feed Processor or Spark connector), update your app to write to both during cutover, then switch reads. Plan for downtime or dual-write complexity. The whole reason this lesson exists is to make sure you don't end up here.

Q2. Can I have multiple partition keys? ▾

Yes — Cosmos supports **hierarchical partition keys** (up to 3 levels deep, e.g. `/tenantId/userId/sessionId`). Queries by tenant, by tenant+user, or by all three are efficient; queries by user alone are not. Caveat — hierarchical keys must be set at container creation; you can't retrofit an existing container.

Q3. How does Cosmos avoid hot partitions when one customer is 100× bigger than the rest? ▾

It doesn't — that's on you. The classic fix is a **synthetic key** like `${customerId}_${month}` or `${customerId}_${randomBucket}`. The first spreads writes across months, the second across N buckets. Trade-off — queries by `customerId` alone become cross-partition.

Q4. What's the actual storage limit per partition again? ▾

20 GB per **logical** partition (per partition-key value). Cosmos auto-splits **physical** partitions transparently when they grow — that's seamless. But if a single partition-key value crosses 20 GB, writes for that value start failing with a hard error. There's no silent split for that case.

Q5. How many partition-key values is "high cardinality"? ▾

Rule of thumb — at least 100× the number of physical partitions you expect. If you'll have 10 physical partitions, you want 1,000+ distinct values. More is always better. Bad keys have 5–10 values (`/status`, `/region`, `/category`); good keys have thousands or millions (`/userId`, `/tenantId`, `/orderId`).