Vector DB Internals — System Design

TL;DR

A vector database stores high-dimensional vectors (the output of embedding models) and answers "find the K closest vectors to this query" in sub-millisecond at billion-scale. The trick is approximate nearest neighbor (ANN) algorithms — HNSW graphs for fast lookups, IVF for sharded search, product quantization for memory compression — combined with sharding, write-ahead logs, and metadata filtering for production.

When to use

Whenever you need to search by **meaning**, not keywords — semantic document search, RAG, recommendation systems, image / audio similarity, anomaly detection, fraud clustering, duplicate detection, code search. The deciding question — does your relevance signal live in an embedding model? If yes, vector DB. If your data is structured rows and you query with `WHERE`, stick with Postgres.

The problem nobody had until 2023

Before LLMs, your search was keywords. SQL LIKE, Elasticsearch BM25, maybe PostgreSQL full-text. All of it was about exact tokens or stems.

Embedding models changed that. Now you can ask:

A vector database is the storage and indexing system that makes this query fast at scale.

1. Why brute force doesn’t work

The naive answer: compute cosine(query, every_vector) and return the top K.

for v in vectors:
    score = cosine(query, v)
return top_k_by_score

You need an index that prunes the search space.

2. HNSW — the graph approach

Hierarchical Navigable Small World graphs. Conceptually a multi-resolution skip list for vectors.

The data structure:

A graph where each vector is a node, edges connect “close” neighbors
Multiple layers — top layer has few nodes with long-range edges, bottom layer has every node with short-range edges
Each node exists in some number of layers (drawn from a geometric distribution)

flowchart TD
    subgraph L2 [Layer 2 — sparse, long edges]
        A2((A))
        E2((E))
        A2 --- E2
    end
    subgraph L1 [Layer 1 — medium density]
        A1((A))
        C1((C))
        E1((E))
        H1((H))
        A1 --- C1
        C1 --- E1
        E1 --- H1
        A1 --- E1
    end
    subgraph L0 [Layer 0 — every node, short edges]
        A0((A))
        B0((B))
        C0((C))
        D0((D))
        E0((E))
        F0((F))
        G0((G))
        H0((H))
        A0 --- B0 --- C0 --- D0
        D0 --- E0 --- F0 --- G0 --- H0
        B0 --- E0
        C0 --- F0
    end
    Q[Query] -.enter at top.-> A2
    A2 -.greedy.-> E2
    E2 -.drop layer.-> E1
    E1 -.greedy.-> H1
    H1 -.drop layer.-> H0

    style Q fill:#7e1d1d,stroke:#ef4444,color:#fff
    style A2 fill:#0e7490,stroke:#06b6d4,color:#fff
    style E2 fill:#0e7490,stroke:#06b6d4,color:#fff
    style A1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style C1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style E1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style H1 fill:#1e3a8a,stroke:#3b82f6,color:#fff

The query:

Enter the top layer at a random point, greedily walk to the closest neighbor
When you can’t find a closer neighbor at this layer, drop down a layer
Repeat — by the bottom layer you’re a few hops from the global nearest neighbor

Build cost:   O(N log N)
Query cost:   O(log N)
Memory:       ~2-4× the raw vectors (graph edges)
Recall@10:    95-99% with default params

3. IVF — the clustering approach

Inverted File Index. Conceptually k-means clustering, with a query-time shortcut.

Build phase:

Train k-means on a sample to find C centroids (typically C = √N)
Assign each vector to its nearest centroid (this is the “inverted file”)

Query phase:

Find the nprobe nearest centroids to the query (nprobe is a knob)
Search only those clusters’ members
Rank and return top K

Trade-offs:

nprobe = 1 → fastest, lowest recall
nprobe = C → equivalent to brute force
nprobe = √C → typical setting, good balance

Build cost:   O(N · iters) — k-means iterations
Query cost:   O(N/C × nprobe)
Memory:       1× raw vectors + small centroid table
Recall@10:    85-95%, very tunable via nprobe

4. Product Quantization — the compression trick

Both HNSW and IVF still store vectors at full size. At billion-scale this is the memory-cost bottleneck. Product Quantization (PQ) compresses vectors 10–100× with controlled precision loss.

How:

Split each vector into M sub-vectors (e.g., 1536 dims → 8 chunks of 192)
Run k-means on each sub-vector independently with K centroids (typically K=256, so 1 byte per chunk)
Replace each sub-vector with its centroid ID

PQ is usually combined with IVF (IVF-PQ) — IVF prunes which clusters to search, PQ makes searching each cluster cheap.

5. The recall-vs-latency knob

Every ANN algorithm has a knob — efSearch for HNSW, nprobe for IVF — that trades recall against latency.

Higher knob → searches more candidates → higher recall, slower
Lower knob  → searches fewer candidates → lower recall, faster

Most teams pick a target recall (e.g., 95%) and tune the knob until they hit it. The right target depends on the use case:

Use case	Target recall	Why
Recommendation	~90%	You’ll show 20 results anyway
RAG	95%+	A missed chunk = wrong answer
Fraud / safety	99%+	A missed match has real cost

6. Sharding — going from 100M to 100B vectors

Single-node ceiling is roughly 100M–500M vectors before you blow out memory or query latency. Above that, shard.

Two natural axes:

Random sharding — hash by vector ID. Each shard handles a fraction of the corpus. Queries fan out to all shards, results merge. Simple but every query hits every shard.
Cluster sharding — assign each IVF cluster to a single shard. Queries hit only the shards holding the closest centroids. More complex routing, but lower fan-out and better latency.

Modern systems (Milvus, Qdrant Cloud) auto-shard transparently. Pinecone hides it entirely.

7. The metadata filtering problem

Real queries are rarely “find similar vectors.” They’re “find similar vectors owned by this user, created in the last 7 days, of type A”.

Three approaches:

Post-filter. Run vector search, drop non-matches. Bad when filter is selective — you can run out of results.

Pre-filter. Apply metadata filter first using regular indexes, then vector-search the survivors. Bad when filter is non-selective — you discarded most of the corpus before the index could help.

Integrated filter. Walk the HNSW graph but skip filter-failing nodes.

8. Writes, durability, and the WAL

Production vector DBs aren’t just indexes — they’re databases. They need to handle:

Writes — append to a write-ahead log, then apply to the index in batches
Updates / deletes — tombstones in the index, compaction in the background
Replication — async replicas for read scaling and failover
Snapshots — periodic dumps for fast restore

9. Why pgvector might be enough

Not every project needs Pinecone. pgvector (Postgres extension) gives you HNSW + IVF + cosine similarity inside a regular Postgres database, with:

Joins to your existing tables
Transactions (rare in dedicated vector DBs)
Single operational footprint
Free

10. The whole system in one diagram

flowchart LR
    subgraph Indexing [Indexing path]
        D[Documents]
        D --> EM1[Embedding model]
        EM1 --> VM[Vector + metadata]
        VM --> WAL[Write-ahead log]
        WAL --> IDX[(ANN index<br/>HNSW / IVF-PQ)]
        IDX -.compact.-> IDX
    end

    subgraph Query [Query path]
        Q[User query]
        Q --> EM2[Embedding model]
        EM2 --> QV[Query vector]
        QV --> S[Search<br/>+ metadata filter]
        IDX -.top-K.-> S
        S --> R[Top-K results]
        R --> RR[Optional rerank]
    end

    style D fill:#1c2333,stroke:#475569,color:#e7eaf1
    style EM1 fill:#581c87,stroke:#a855f7,color:#fff
    style EM2 fill:#581c87,stroke:#a855f7,color:#fff
    style VM fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style QV fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style WAL fill:#0f1320,stroke:#475569,color:#cdd3df
    style IDX fill:#0e7490,stroke:#06b6d4,color:#fff
    style S fill:#9a3412,stroke:#f97316,color:#fff
    style R fill:#365314,stroke:#84cc16,color:#fff
    style RR fill:#1c2333,stroke:#475569,color:#e7eaf1
    style Q fill:#1c2333,stroke:#475569,color:#e7eaf1