A vector database stores high-dimensional vectors (the output of embedding models) and answers "find the K closest vectors to this query" in sub-millisecond at billion-scale. The trick is approximate nearest neighbor (ANN) algorithms — HNSW graphs for fast lookups, IVF for sharded search, product quantization for memory compression — combined with sharding, write-ahead logs, and metadata filtering for production.
Whenever you need to search by **meaning**, not keywords — semantic document search, RAG, recommendation systems, image / audio similarity, anomaly detection, fraud clustering, duplicate detection, code search. The deciding question — does your relevance signal live in an embedding model? If yes, vector DB. If your data is structured rows and you query with `WHERE`, stick with Postgres.
The problem nobody had until 2023
Before LLMs, your search was keywords. SQL LIKE, Elasticsearch BM25, maybe PostgreSQL full-text. All of it was about exact tokens or stems.
Embedding models changed that. Now you can ask:
A vector database is the storage and indexing system that makes this query fast at scale.
1. Why brute force doesn’t work
The naive answer: compute cosine(query, every_vector) and return the top K.
for v in vectors:
score = cosine(query, v)
return top_k_by_score
You need an index that prunes the search space.
2. HNSW — the graph approach
Hierarchical Navigable Small World graphs. Conceptually a multi-resolution skip list for vectors.
The data structure:
- A graph where each vector is a node, edges connect “close” neighbors
- Multiple layers — top layer has few nodes with long-range edges, bottom layer has every node with short-range edges
- Each node exists in some number of layers (drawn from a geometric distribution)
flowchart TD
subgraph L2 [Layer 2 — sparse, long edges]
A2((A))
E2((E))
A2 --- E2
end
subgraph L1 [Layer 1 — medium density]
A1((A))
C1((C))
E1((E))
H1((H))
A1 --- C1
C1 --- E1
E1 --- H1
A1 --- E1
end
subgraph L0 [Layer 0 — every node, short edges]
A0((A))
B0((B))
C0((C))
D0((D))
E0((E))
F0((F))
G0((G))
H0((H))
A0 --- B0 --- C0 --- D0
D0 --- E0 --- F0 --- G0 --- H0
B0 --- E0
C0 --- F0
end
Q[Query] -.enter at top.-> A2
A2 -.greedy.-> E2
E2 -.drop layer.-> E1
E1 -.greedy.-> H1
H1 -.drop layer.-> H0
style Q fill:#7e1d1d,stroke:#ef4444,color:#fff
style A2 fill:#0e7490,stroke:#06b6d4,color:#fff
style E2 fill:#0e7490,stroke:#06b6d4,color:#fff
style A1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
style C1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
style E1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
style H1 fill:#1e3a8a,stroke:#3b82f6,color:#fff
The query:
- Enter the top layer at a random point, greedily walk to the closest neighbor
- When you can’t find a closer neighbor at this layer, drop down a layer
- Repeat — by the bottom layer you’re a few hops from the global nearest neighbor
Build cost: O(N log N)
Query cost: O(log N)
Memory: ~2-4× the raw vectors (graph edges)
Recall@10: 95-99% with default params
3. IVF — the clustering approach
Inverted File Index. Conceptually k-means clustering, with a query-time shortcut.
Build phase:
- Train k-means on a sample to find C centroids (typically C = √N)
- Assign each vector to its nearest centroid (this is the “inverted file”)
Query phase:
- Find the
nprobenearest centroids to the query (nprobeis a knob) - Search only those clusters’ members
- Rank and return top K
Trade-offs:
nprobe = 1→ fastest, lowest recallnprobe = C→ equivalent to brute forcenprobe = √C→ typical setting, good balance
Build cost: O(N · iters) — k-means iterations
Query cost: O(N/C × nprobe)
Memory: 1× raw vectors + small centroid table
Recall@10: 85-95%, very tunable via nprobe
4. Product Quantization — the compression trick
Both HNSW and IVF still store vectors at full size. At billion-scale this is the memory-cost bottleneck. Product Quantization (PQ) compresses vectors 10–100× with controlled precision loss.
How:
- Split each vector into M sub-vectors (e.g., 1536 dims → 8 chunks of 192)
- Run k-means on each sub-vector independently with K centroids (typically K=256, so 1 byte per chunk)
- Replace each sub-vector with its centroid ID
PQ is usually combined with IVF (IVF-PQ) — IVF prunes which clusters to search, PQ makes searching each cluster cheap.
5. The recall-vs-latency knob
Every ANN algorithm has a knob — efSearch for HNSW, nprobe for IVF — that trades recall against latency.
Higher knob → searches more candidates → higher recall, slower
Lower knob → searches fewer candidates → lower recall, faster
Most teams pick a target recall (e.g., 95%) and tune the knob until they hit it. The right target depends on the use case:
| Use case | Target recall | Why |
|---|---|---|
| Recommendation | ~90% | You’ll show 20 results anyway |
| RAG | 95%+ | A missed chunk = wrong answer |
| Fraud / safety | 99%+ | A missed match has real cost |
6. Sharding — going from 100M to 100B vectors
Single-node ceiling is roughly 100M–500M vectors before you blow out memory or query latency. Above that, shard.
Two natural axes:
- Random sharding — hash by vector ID. Each shard handles a fraction of the corpus. Queries fan out to all shards, results merge. Simple but every query hits every shard.
- Cluster sharding — assign each IVF cluster to a single shard. Queries hit only the shards holding the closest centroids. More complex routing, but lower fan-out and better latency.
Modern systems (Milvus, Qdrant Cloud) auto-shard transparently. Pinecone hides it entirely.
7. The metadata filtering problem
Real queries are rarely “find similar vectors.” They’re “find similar vectors owned by this user, created in the last 7 days, of type A”.
Three approaches:
Post-filter. Run vector search, drop non-matches. Bad when filter is selective — you can run out of results.
Pre-filter. Apply metadata filter first using regular indexes, then vector-search the survivors. Bad when filter is non-selective — you discarded most of the corpus before the index could help.
Integrated filter. Walk the HNSW graph but skip filter-failing nodes.
8. Writes, durability, and the WAL
Production vector DBs aren’t just indexes — they’re databases. They need to handle:
- Writes — append to a write-ahead log, then apply to the index in batches
- Updates / deletes — tombstones in the index, compaction in the background
- Replication — async replicas for read scaling and failover
- Snapshots — periodic dumps for fast restore
9. Why pgvector might be enough
Not every project needs Pinecone. pgvector (Postgres extension) gives you HNSW + IVF + cosine similarity inside a regular Postgres database, with:
- Joins to your existing tables
- Transactions (rare in dedicated vector DBs)
- Single operational footprint
- Free
10. The whole system in one diagram
flowchart LR
subgraph Indexing [Indexing path]
D[Documents]
D --> EM1[Embedding model]
EM1 --> VM[Vector + metadata]
VM --> WAL[Write-ahead log]
WAL --> IDX[(ANN index<br/>HNSW / IVF-PQ)]
IDX -.compact.-> IDX
end
subgraph Query [Query path]
Q[User query]
Q --> EM2[Embedding model]
EM2 --> QV[Query vector]
QV --> S[Search<br/>+ metadata filter]
IDX -.top-K.-> S
S --> R[Top-K results]
R --> RR[Optional rerank]
end
style D fill:#1c2333,stroke:#475569,color:#e7eaf1
style EM1 fill:#581c87,stroke:#a855f7,color:#fff
style EM2 fill:#581c87,stroke:#a855f7,color:#fff
style VM fill:#1e3a8a,stroke:#3b82f6,color:#fff
style QV fill:#1e3a8a,stroke:#3b82f6,color:#fff
style WAL fill:#0f1320,stroke:#475569,color:#cdd3df
style IDX fill:#0e7490,stroke:#06b6d4,color:#fff
style S fill:#9a3412,stroke:#f97316,color:#fff
style R fill:#365314,stroke:#84cc16,color:#fff
style RR fill:#1c2333,stroke:#475569,color:#e7eaf1
style Q fill:#1c2333,stroke:#475569,color:#e7eaf1
Comments 0
Discuss this page. Markdown supported. Be kind.