Home / Cosmos DB / Engineer Course
V3
📐
Lesson · 5 min

Data Modeling

Embed vs. reference. Denormalization patterns for NoSQL.

TL;DR

SQL teaches you to normalize. Cosmos punishes you for it. The right model in NoSQL is the one that answers your hottest query in a single read — even if that means duplicating data across documents. Embed for 1:1 and 1:few; reference for 1:many that grows unbounded; denormalize anything you'd otherwise JOIN.

Key takeaways
  • Model around your read patterns, not your entities. List the top 3 queries first; the schema falls out of that.
  • Embed when the child is bounded, owned by the parent, and read together (e.g. an order with its line items).
  • Reference when the child grows unbounded, is shared across parents, or has its own access pattern (e.g. a user's posts).
  • Denormalize liberally. Storing the customer's name on every order is fine — disk is cheap, RUs are not.
  • Keep the partition key on every document type that participates in a transaction, or the transaction won't work.

Coming from SQL, your instinct is to draw an ER diagram, normalize to 3NF, and call it a model. In Cosmos, that approach loses. Every JOIN becomes a cross-partition query; every foreign key is a separate point-read at 1 RU each. A normalized model that returns a customer + their last 10 orders takes 11 round trips. Modeled the Cosmos way, it takes one.

The mental shift

Stop thinking about entities. Start thinking about screens.

Pick the three highest-traffic screens in your app — the homepage, the product detail page, the checkout. For each, ask — what data do I need to render this in one shot? The answer is your document.

This is called query-driven modeling, and it’s the single biggest unlearning a SQL developer goes through.

Embed vs. reference: the cheat sheet

RelationshipPatternExample
1:1EmbedA user’s preferences inside the user document
1:few (bounded)EmbedAn order’s line items (rarely > 50)
1:many (unbounded)ReferenceA user’s posts — could be millions
Many:manyReference + denormalizeTags on posts, users in groups
Hot read pathEmbed (even if it duplicates)The author’s display name on every comment

The unbounded case is the trap. Embedded arrays make the parent doc rewrite-heavy and eventually push past the 2 MB doc limit. Referenced collections need a second query but scale forever.

Denormalize without guilt

In SQL you’d never store customerName on the Order table — it’s redundant, you can JOIN. In Cosmos, you absolutely store it. Here’s why:

  • Reads are 6× more frequent than writes in most apps. Optimize for the common case.
  • A point read costs 1 RU. A two-doc query costs 4–8 RUs. Multiply by traffic.
  • Stale denormalized fields are usually a feature — the order should reflect the customer’s name at time of purchase, not their current name.

When something genuinely must stay in sync — say, a price change that should update all open carts — use Change Feed (lesson V11) to push the update everywhere it’s denormalized.

A worked example: e-commerce

Three screens — product page, cart, order history.

Product page — read product by id, show name, photos, price, top 5 reviews. Embed the top 5 reviews in the product doc. The full review history lives in a separate reviews container, partitioned by productId.

Cart — read cart by userId, show items with current product info. Cart doc embeds line items with denormalized name + price. Background job watches Change Feed on products and updates open carts when price changes.

Order history — list orders for a user. orders container partitioned by /userId. Each order embeds its line items (bounded, never edited after placement).

That’s three containers, three partition keys, every screen rendered with one read.

The transaction constraint

If two documents need to update atomically, they must share a partition key. Cosmos’s transactional batches only work within a single logical partition. So if you’ve separated users and userPreferences into different containers, you can’t update both in one transaction.

Fix — keep them in the same container with /userId as the partition key, distinguished by a type field. One container, multiple shapes, one transactional unit.

What’s next

Lesson V04 covers consistency levels — once you know your model, you’ll want to know which knob to turn for which read. Lesson V06 deep-dives querying so you can write efficient SQL on the model you just designed.

🎯 Common questions
Q1. How big can an embedded array get before I should switch to a reference?

Two limits — the document hits 2 MB (Cosmos's hard cap), or the array starts triggering full-document rewrites on every change. Practical rule — if it grows past ~100 items or ~50 KB, switch to a referenced collection.

Q2. Won't denormalization cause stale data?

Yes — and that's usually fine. The customer's display name on a 6-month-old order being slightly stale is a feature, not a bug (it's a snapshot). For fields that must stay in sync, use Change Feed (lesson V11) to fan updates out.

Q3. Should I keep different entity types in the same container?

Often, yes. If a user, their orders, and their addresses are always queried together by `userId`, putting them in one container with `/userId` as the partition key gives you single-partition reads and transactional batches. Use a `type` field to discriminate.

Key concepts
embedreference1:11:few1:manydenormalization
🧪 Simulator

A live simulator for this lesson's mechanic (e.g. RU calculator, partition-key picker). Coming in Phase 2.

🎨 Visualization

An interactive diagram of this lesson's core idea — coming as we build out the visualization library.

💻 Code

A copy-paste reference snippet plus a short build challenge.

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…
Loading comments…