← AI Systems
📐
AI Systems

Vector DB Internals

HNSW, IVF, and product quantization — how databases search billions of embeddings in milliseconds.

TL;DR

A vector database stores high-dimensional vectors (the output of embedding models) and answers "find the K closest vectors to this query" in sub-millisecond at billion-scale. The trick is approximate nearest neighbor (ANN) algorithms — HNSW graphs for fast lookups, IVF for sharded search, product quantization for memory compression — combined with sharding, write-ahead logs, and metadata filtering for production.

When to use

Whenever you need to search by **meaning**, not keywords — semantic document search, RAG, recommendation systems, image / audio similarity, anomaly detection, fraud clustering, duplicate detection, code search. The deciding question — does your relevance signal live in an embedding model? If yes, vector DB. If your data is structured rows and you query with `WHERE`, stick with Postgres.

The problem nobody had until 2023

Before LLMs, your search was keywords. SQL LIKE, Elasticsearch BM25, maybe PostgreSQL full-text. All of it was about exact tokens or stems.

Embedding models changed that. Now you can ask:

A vector database is the storage and indexing system that makes this query fast at scale.

1. Why brute force doesn’t work

The naive answer: compute cosine(query, every_vector) and return the top K.

for v in vectors:
    score = cosine(query, v)
return top_k_by_score

You need an index that prunes the search space.

2. HNSW — the graph approach

Hierarchical Navigable Small World graphs. Conceptually a multi-resolution skip list for vectors.

The data structure:

  • A graph where each vector is a node, edges connect “close” neighbors
  • Multiple layers — top layer has few nodes with long-range edges, bottom layer has every node with short-range edges
  • Each node exists in some number of layers (drawn from a geometric distribution)
flowchart TD
    subgraph L2 [Layer 2 — sparse, long edges]
        A2((A))
        E2((E))
        A2 --- E2
    end
    subgraph L1 [Layer 1 — medium density]
        A1((A))
        C1((C))
        E1((E))
        H1((H))
        A1 --- C1
        C1 --- E1
        E1 --- H1
        A1 --- E1
    end
    subgraph L0 [Layer 0 — every node, short edges]
        A0((A))
        B0((B))
        C0((C))
        D0((D))
        E0((E))
        F0((F))
        G0((G))
        H0((H))
        A0 --- B0 --- C0 --- D0
        D0 --- E0 --- F0 --- G0 --- H0
        B0 --- E0
        C0 --- F0
    end
    Q[Query] -.enter at top.-> A2
    A2 -.greedy.-> E2
    E2 -.drop layer.-> E1
    E1 -.greedy.-> H1
    H1 -.drop layer.-> H0

    style Q fill:#7e1d1d,stroke:#ef4444,color:#fff
    style A2 fill:#0e7490,stroke:#06b6d4,color:#fff
    style E2 fill:#0e7490,stroke:#06b6d4,color:#fff
    style A1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style C1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style E1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style H1 fill:#1e3a8a,stroke:#3b82f6,color:#fff

The query:

  1. Enter the top layer at a random point, greedily walk to the closest neighbor
  2. When you can’t find a closer neighbor at this layer, drop down a layer
  3. Repeat — by the bottom layer you’re a few hops from the global nearest neighbor
Build cost:   O(N log N)
Query cost:   O(log N)
Memory:       ~2-4× the raw vectors (graph edges)
Recall@10:    95-99% with default params

3. IVF — the clustering approach

Inverted File Index. Conceptually k-means clustering, with a query-time shortcut.

Build phase:

  1. Train k-means on a sample to find C centroids (typically C = √N)
  2. Assign each vector to its nearest centroid (this is the “inverted file”)

Query phase:

  1. Find the nprobe nearest centroids to the query (nprobe is a knob)
  2. Search only those clusters’ members
  3. Rank and return top K

Trade-offs:

  • nprobe = 1 → fastest, lowest recall
  • nprobe = C → equivalent to brute force
  • nprobe = √C → typical setting, good balance
Build cost:   O(N · iters) — k-means iterations
Query cost:   O(N/C × nprobe)
Memory:       1× raw vectors + small centroid table
Recall@10:    85-95%, very tunable via nprobe

4. Product Quantization — the compression trick

Both HNSW and IVF still store vectors at full size. At billion-scale this is the memory-cost bottleneck. Product Quantization (PQ) compresses vectors 10–100× with controlled precision loss.

How:

  1. Split each vector into M sub-vectors (e.g., 1536 dims → 8 chunks of 192)
  2. Run k-means on each sub-vector independently with K centroids (typically K=256, so 1 byte per chunk)
  3. Replace each sub-vector with its centroid ID

PQ is usually combined with IVF (IVF-PQ) — IVF prunes which clusters to search, PQ makes searching each cluster cheap.

5. The recall-vs-latency knob

Every ANN algorithm has a knob — efSearch for HNSW, nprobe for IVF — that trades recall against latency.

Higher knob → searches more candidates → higher recall, slower
Lower knob  → searches fewer candidates → lower recall, faster

Most teams pick a target recall (e.g., 95%) and tune the knob until they hit it. The right target depends on the use case:

Use caseTarget recallWhy
Recommendation~90%You’ll show 20 results anyway
RAG95%+A missed chunk = wrong answer
Fraud / safety99%+A missed match has real cost

6. Sharding — going from 100M to 100B vectors

Single-node ceiling is roughly 100M–500M vectors before you blow out memory or query latency. Above that, shard.

Two natural axes:

  • Random sharding — hash by vector ID. Each shard handles a fraction of the corpus. Queries fan out to all shards, results merge. Simple but every query hits every shard.
  • Cluster sharding — assign each IVF cluster to a single shard. Queries hit only the shards holding the closest centroids. More complex routing, but lower fan-out and better latency.

Modern systems (Milvus, Qdrant Cloud) auto-shard transparently. Pinecone hides it entirely.

7. The metadata filtering problem

Real queries are rarely “find similar vectors.” They’re “find similar vectors owned by this user, created in the last 7 days, of type A”.

Three approaches:

Post-filter. Run vector search, drop non-matches. Bad when filter is selective — you can run out of results.

Pre-filter. Apply metadata filter first using regular indexes, then vector-search the survivors. Bad when filter is non-selective — you discarded most of the corpus before the index could help.

Integrated filter. Walk the HNSW graph but skip filter-failing nodes.

8. Writes, durability, and the WAL

Production vector DBs aren’t just indexes — they’re databases. They need to handle:

  • Writes — append to a write-ahead log, then apply to the index in batches
  • Updates / deletes — tombstones in the index, compaction in the background
  • Replication — async replicas for read scaling and failover
  • Snapshots — periodic dumps for fast restore

9. Why pgvector might be enough

Not every project needs Pinecone. pgvector (Postgres extension) gives you HNSW + IVF + cosine similarity inside a regular Postgres database, with:

  • Joins to your existing tables
  • Transactions (rare in dedicated vector DBs)
  • Single operational footprint
  • Free

10. The whole system in one diagram

flowchart LR
    subgraph Indexing [Indexing path]
        D[Documents]
        D --> EM1[Embedding model]
        EM1 --> VM[Vector + metadata]
        VM --> WAL[Write-ahead log]
        WAL --> IDX[(ANN index<br/>HNSW / IVF-PQ)]
        IDX -.compact.-> IDX
    end

    subgraph Query [Query path]
        Q[User query]
        Q --> EM2[Embedding model]
        EM2 --> QV[Query vector]
        QV --> S[Search<br/>+ metadata filter]
        IDX -.top-K.-> S
        S --> R[Top-K results]
        R --> RR[Optional rerank]
    end

    style D fill:#1c2333,stroke:#475569,color:#e7eaf1
    style EM1 fill:#581c87,stroke:#a855f7,color:#fff
    style EM2 fill:#581c87,stroke:#a855f7,color:#fff
    style VM fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style QV fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style WAL fill:#0f1320,stroke:#475569,color:#cdd3df
    style IDX fill:#0e7490,stroke:#06b6d4,color:#fff
    style S fill:#9a3412,stroke:#f97316,color:#fff
    style R fill:#365314,stroke:#84cc16,color:#fff
    style RR fill:#1c2333,stroke:#475569,color:#e7eaf1
    style Q fill:#1c2333,stroke:#475569,color:#e7eaf1
📺 Video

A YouTube walkthrough — coming once the deep-dive video is published.

🧪 Simulator

An interactive simulator for this concept is on the way — tweak the knobs, watch behaviour change in real time.

🎨 Visualization

An interactive diagram you can hover, click, and explore.

💻 Code

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions
Q1. HNSW vs IVF — when do you pick which?

HNSW is faster at query time and gives higher recall, but uses more memory and is slow to build. IVF is cheaper to index, smaller in memory, supports easy sharding, but takes a hit on recall at very low search-list sizes. Pick HNSW for read-heavy workloads under ~100M vectors. Pick IVF (often combined with PQ) for billion-scale where memory matters more than the last 2% of recall.

Q2. What is product quantization actually doing?

It splits a vector (say, 1536 dims) into M sub-vectors (say, 8 groups of 192 dims), then runs k-means on each sub-vector independently to learn a codebook of K centroids (say, K=256). Each sub-vector is replaced by its centroid index — 8 bits instead of 192 floats. The whole vector compresses from 6 KB to 8 bytes — a 750× reduction. Distance is computed via lookup tables on the codebooks, not the raw vectors.

Q3. How do you keep an HNSW index fresh when documents change?

HNSW supports inserts efficiently, but deletes are tombstones (the node stays in the graph, marked deleted). Over time the graph degrades. The fix is periodic **compaction** — rebuild the index in the background from current data, then atomically swap. Most production systems run two indexes — the live one accepting writes, and a shadow one being rebuilt. After build completes, traffic switches.

Q4. Why does cosine similarity matter more than Euclidean distance for text embeddings?

Embeddings from text models have variable magnitude — longer or more "energetic" sentences produce longer vectors. Cosine similarity normalizes by magnitude, so it compares pure direction. Two sentences with the same meaning but different lengths land in the same direction even at different magnitudes. Euclidean distance would penalize the magnitude difference, leaking noise into the relevance signal. (Most embedding models output normalized vectors anyway, in which case cosine and Euclidean rank-order identically.)

Q5. How do you do filtered search efficiently — "vectors from this user, similar to this query"?

Two approaches. **Pre-filter**, where you first apply the metadata filter (using a regular B-tree index on user_id), then do vector search on the survivors. Best when filters are highly selective. **Post-filter**, where you do vector search first, then drop non-matches. Best when filters are non-selective. Modern vector DBs (Qdrant, Weaviate, Pinecone) support **integrated filtering** that interleaves the two — traverse the HNSW graph, but skip nodes that fail the filter. This avoids both pre-filter's "small candidate set" problem and post-filter's "discarded a lot of good results" problem.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…
Loading comments…