Storage Engines: B-Tree / LSM / Bw-Tree — Cosmos DB

TL;DR

A storage engine is the layer that puts your bytes on disk and reads them back. Three families dominate — B-trees (PostgreSQL, MySQL, MongoDB) optimize random reads with in-place updates; LSM trees (Cassandra, RocksDB, LevelDB) optimize writes by appending and compacting; Bw-trees (Cosmos DB, Hekaton) get B-tree reads with lock-free, append-only writes by indirecting through a mapping table.

Key takeaways

▸B-tree — fixed-size pages, in-place updates, balanced via splits. Best for read-heavy workloads. Write amplification ~10–30× because every update rewrites a page.
▸LSM tree — writes go to a sorted in-memory buffer (memtable), flushed to immutable on-disk files (SSTables), merged in background (compaction). Best for write-heavy workloads. Read amplification because data lives in many files.
▸Bw-tree — B-tree logical structure but updates are appended as deltas. A mapping table redirects logical pages to physical chains. Lock-free, friendly to flash/NVMe, used inside Cosmos DB.
▸Write amplification is the killer metric — 1 logical write = N physical writes. B-trees pay it on every update; LSM trees amortize it via background compaction; Bw-trees minimize it by appending.
▸You don't choose your engine in Cosmos (it's Bw-tree internally), but knowing the family explains why operations cost what they cost — and why some other DBs feel different.

A storage engine is the lowest layer of a database — the part that takes a logical operation like “set key X to value Y” and turns it into actual writes to disk, then turns “read key X” back into the right bytes. Above it sits the query planner, the network protocol, the consistency layer. Below it sits the operating system and the hardware.

Three families dominate the landscape. Each makes different trade-offs, and those trade-offs explain why one database is great at writes, another at reads, another at both. Cosmos DB uses a fourth, less famous descendant — the Bw-tree.

B-Tree: balanced, page-based, classic

The B-tree (and its more common variant, B+ tree) has been the default choice since the 1970s. PostgreSQL, MySQL InnoDB, MongoDB WiredTiger, SQL Server — all B-tree under the hood.

The structure:

                    [50, 100]                       ← internal page
                   /     |     \
              [10,30] [60,80] [110,150]             ← internal pages
              / | \   / | \    / | \
            ...     ...     ...                    ← leaf pages (with data)

Fixed-size pages (typically 4–16 KB) hold sorted keys.
Internal pages route to child pages.
Leaf pages hold the actual data (or pointers to data).
Updates rewrite the page in place. If the page overflows, it splits into two. If a sibling underflows, they merge.

Strengths

Read latency is predictable — at most a handful of disk seeks (height ≈ log_B(N)).
Range scans are fast — leaves are linked, so you walk sequentially.
Reasoning is simple — it’s a tree.

Weaknesses

Write amplification. Updating a single 100-byte field requires rewriting the whole 8 KB page. With WAL + index updates + buffer-pool flushes, one logical write becomes 10–30 physical writes.
Random I/O. Modifying scattered keys means scattered page rewrites. SSDs handle this OK; spinning disks don’t.
Concurrent updates need locks. B-trees use latches (lightweight locks) on pages during writes. Under heavy contention, latches become bottlenecks.

LSM Tree: write-optimized, append-everywhere

Log-Structured Merge trees flip the assumption — instead of optimizing for reads with in-place updates, they optimize for writes by appending everywhere.

WRITE PATH                                READ PATH

write → memtable                          read → memtable + L0 + L1 + L2 + ...
         (in-mem sorted)                          (check each level)
         |
         v full →
   flush as SSTable (immutable)                Bloom filters
         |                                     prune unnecessary
         v compaction →                        SSTable lookups
   merge SSTables in background
   (level 0 → level 1 → ...)

Writes go to an in-memory memtable — a sorted in-memory tree (skip-list or similar). Single-digit-microsecond writes.
When the memtable fills, it’s flushed to disk as an immutable SSTable (Sorted String Table). Flush is a sequential write — much faster than B-tree’s random writes.
Reads check the memtable first, then each SSTable level until they find the key.
Compaction runs in the background — merging SSTables, removing obsolete versions, eliminating tombstones.

Strengths

Sequential writes — friendly to all storage media, especially SSDs and HDDs.
High write throughput — Cassandra easily hits 10–100K writes/sec/node.
Compaction is async — foreground writes don’t pay full amortization cost.

Weaknesses

Read amplification — a key might exist in 5+ SSTables. Bloom filters prune most lookups, but worst-case reads are slower than B-tree.
Compaction I/O — compaction competes with foreground reads. Tail latency spikes when compaction runs.
Space amplification — until compaction merges, the same key can have multiple obsolete copies on disk.

Used by — Cassandra, ScyllaDB, RocksDB (and everything built on it — TiKV, CockroachDB’s storage layer), LevelDB, BigTable, HBase.

Bw-Tree: B-tree logic, append-only physics

The Bw-tree, designed at Microsoft Research and used inside Cosmos DB and SQL Server’s Hekaton in-memory engine, gets the read profile of a B-tree with the write profile of an LSM tree.

The trick — immutable pages with a mapping table.

   Logical PageID  ──→  [Mapping Table]  ──→  Physical address of latest delta

   Updates append a delta to the page's chain (Compare-And-Swap on mapping):

      [Page V0] ← [Δ insert(7)] ← [Δ delete(4)] ← [Δ insert(9)]
                                                      ↑
                                                  CAS here

Every B-tree page has a logical PageID.
A mapping table translates PageID → physical address.
Updates don’t rewrite pages. They append a delta record (an insert, delete, etc.) and atomically swing the mapping-table pointer with a Compare-And-Swap.
Reads follow the chain of deltas from newest to base page, applying them in reverse-chronological order.
When chains get too long, a background consolidation rewrites the chain as a fresh page (also via CAS).

Strengths

Lock-free. No latches anywhere — concurrent updates are CAS-based. Scales linearly with CPU cores.
Flash-friendly. All writes are sequential appends. No in-place page rewrites.
B-tree reads. The logical structure is still a B-tree — point reads and range scans are efficient.

Weaknesses

Complexity. The implementation is significantly more involved than B-tree or LSM. Few systems have the engineering capacity to build one well.
Consolidation cost. Long delta chains must be consolidated, which is similar in spirit to LSM compaction.

Used by — Cosmos DB (per-partition), SQL Server Hekaton, a small handful of research systems.

Why this matters even though you can’t choose

You don’t get to pick Cosmos’s storage engine — it’s Bw-tree, end of story. But understanding the family explains so much:

Why writes cost more than reads. Bw-tree appends are cheap, but they still cost RUs because the index updates ripple to multiple delta chains. Lesson V07 (Indexing) shows how to limit that.
Why range queries on the partition key are cheap. Bw-tree leaves are linked — a range scan is a sequential walk.
Why deletes are still expensive. A delete is just another delta in the chain; the actual space isn’t reclaimed until consolidation runs.
Why the partition key matters at the storage layer. Each partition has its own Bw-tree. Cross-partition queries fan out across N independent trees.

Where this connects

F1 Replication — replication runs above the storage engine. Each replica has its own Bw-tree; consensus keeps them in sync.
F4 Cosmos Architecture — the Bw-tree sits inside each replica, below the indexing layer.
V07 Indexing — every indexed JSON path is a separate Bw-tree. Cutting unused indexes cuts write amplification.
V09 Request Units — the Bw-tree’s append-only design is why writes cost ~5–7 RUs and reads cost ~1 RU. Different engine, different ratio.

If you’re building a database from scratch, the engine choice is consequential. If you’re using a managed system like Cosmos, the engine choice has been made for you — and knowing why it was made tells you what the system will be good at and what it won’t.

🎯 Common questions

Q1. Why does Cosmos use a Bw-tree instead of a regular B-tree? ▾

Two reasons. (1) Lock-free updates — Bw-tree pages are immutable; updates are deltas appended to a chain. No latches, no contention, scales linearly with cores. (2) Flash-friendly — flash storage hates in-place updates (every overwrite forces a block erase). Append-only writes line up perfectly with how SSDs/NVMe drives want to be written. Same insight that drove LSM-trees, but with the read profile of a B-tree.

Q2. Which engine has the best write throughput? ▾

LSM trees, by a wide margin. A write is just an in-memory append + WAL write — single-digit microseconds. B-trees pay 10–30× write amplification on every update because they rewrite full pages in place. Bw-trees are between the two — append-only but maintains a tree structure, so writes aren't quite as cheap as LSM but reads stay fast.

Q3. What's compaction and why does it matter? ▾

Compaction is the LSM-tree garbage collector. Old SSTables are merged into newer ones, eliminating obsolete versions and freeing space. It runs in the background. The catch — compaction is I/O-heavy and competes with foreground reads. A poorly-tuned LSM database can have great write throughput at midnight and terrible tail latency at noon when compaction kicks in.

Q4. Why does index choice change write cost so dramatically? ▾

Every index is a separate B-tree (or equivalent). Adding 10 indexed paths to a doc means each write touches 11 trees (the doc + 10 indexes). On a B-tree engine, that's 11 page updates. On a Bw-tree, 11 delta appends. The cost scales linearly with index count — which is why Cosmos's "index everything by default" policy is convenient but expensive. Lesson V07 explores opting out.

Q5. What's an SSTable? ▾

Sorted String Table — an immutable file of sorted key-value pairs, used by LSM-tree engines. Data is sorted on disk, with a sparse index in-memory pointing to block boundaries. Reads do a binary search on the index, then read the matching block. Used by RocksDB, LevelDB, Cassandra, ScyllaDB, BigTable.

Q6. When would I pick PostgreSQL (B-tree) over Cassandra (LSM)? ▾

PostgreSQL when read patterns are complex (joins, aggregates, ad-hoc queries) and writes are modest. Cassandra when writes dominate (telemetry, audit logs, time-series) and reads are predictable point/range queries. The engine's strengths match the workload — pick the wrong one and you'll spend years working around it.