Retrieval-Augmented Generation grounds an LLM in your data without retraining it. You chunk your documents, embed each chunk into a vector, store them in a vector DB, embed the user's question with the same model, retrieve the top-K most similar chunks, optionally rerank them with a cross-encoder, stuff the survivors into the prompt, and let the LLM answer with citations. The model isn't smarter — the pipeline is.
Any time you want an LLM to answer questions about data it wasn't trained on — internal docs, support tickets, product catalogs, code, legal contracts, medical records, fresh news. RAG is also the right tool when answers must be **citable** (regulated industries, customer support, legal, medical). It's the wrong tool when you need persistent reasoning across a long thread (use long context or fine-tuning) or when the data is so small it fits in the prompt directly.
The problem RAG solves
You give an LLM a question about your company’s HR policy. It answers confidently — and wrongly. Not maliciously, just statistically. The model has no idea what your policy actually says. It saw thousands of HR docs in training and reaches for the most plausible-sounding answer.
This is the grounding problem. Without something to ground its output in, an LLM is a fluent guesser.
There are three ways to fix it:
- Fine-tune — bake your data into the model’s weights (slow, expensive, has to be redone for every update)
- Long context — paste all your data into the prompt (costs scale linearly, fails at hundreds of pages)
- Retrieval-Augmented Generation — fetch only the relevant data on demand (fast, cheap, fresh)
RAG is the dominant production pattern because it’s the only one that scales to millions of docs, updates instantly, and works with any LLM.
The pipeline at a glance
Every RAG system, from a 10-line LangChain demo to OpenAI’s enterprise stack, is the same shape:
flowchart LR
subgraph Indexing [Indexing — done once / on update]
D[Documents<br/>PDFs, wikis, code]
D --> C[Chunk]
C --> E1[Embed]
E1 --> V[(Vector DB)]
end
subgraph Query [Query — every request]
Q[User question]
Q --> E2[Embed]
E2 --> R[Retrieve top-K]
R --> RR[Rerank]
RR --> P[Prompt assembly]
P --> L[LLM]
L --> A[Answer + citations]
end
V -.shared embedding<br/>model.-> E2
V -.top-K chunks.-> R
style D fill:#1c2333,stroke:#475569,color:#e7eaf1
style C fill:#1e3a8a,stroke:#3b82f6,color:#fff
style E1 fill:#581c87,stroke:#a855f7,color:#fff
style E2 fill:#581c87,stroke:#a855f7,color:#fff
style V fill:#0f1320,stroke:#475569,color:#cdd3df
style Q fill:#1c2333,stroke:#475569,color:#e7eaf1
style R fill:#0e7490,stroke:#06b6d4,color:#fff
style RR fill:#9a3412,stroke:#f97316,color:#fff
style P fill:#365314,stroke:#84cc16,color:#fff
style L fill:#7e1d1d,stroke:#ef4444,color:#fff
style A fill:#1c2333,stroke:#475569,color:#e7eaf1
1. Chunking — splitting your knowledge into searchable units
You start with documents. PDFs, wikis, ticket histories, codebases, contracts. Step one is to split them into chunks.
Why chunk? Two reasons:
- Embedding models have a context limit (typically 512–8192 tokens)
- Retrieval should return the paragraph that answers the question, not the whole 50-page document
Chunking strategies, ranked by sophistication:
1. Fixed-size — every 512 tokens, hard cut. Simple, lossy at boundaries.
2. Sliding overlap — 512 tokens with 50-token overlap. Standard.
3. Semantic — split at sentence boundaries, group sentences with similar embeddings.
4. Structural — for code: split by function. For markdown: by heading. For tables: by row.
5. Hierarchical — chunk at multiple resolutions (paragraph + section + chapter), retrieve hierarchically.
2. Embedding — turning text into vectors
Each chunk passes through an embedding model — a smaller transformer (or sometimes a bi-encoder like text-embedding-3-small) that outputs a fixed-dimensional vector, typically 768 or 1536 dimensions.
The magic property of embedding spaces:
Cost calculus: embedding 1 million chunks at 512 tokens each is 500M tokens. At OpenAI Ada-3 prices ($0.02/M tokens for embeddings), that’s $10. Cheap.
3. Storing — the vector database
A million chunks × 1536 dimensions × 4 bytes = 6 GB. Not huge. But the hard part is searching it fast.
Brute force (compute cosine similarity to every vector) is O(N) per query — at a million chunks that’s already too slow. Vector DBs use approximate nearest neighbor (ANN) indexes — HNSW graphs, IVF clusters, or product quantization (see the Vector DB article for details) — to drop this to O(log N) or O(√N).
Production options: Pinecone, Weaviate, Milvus, Qdrant, pgvector for Postgres, Elasticsearch with kNN. Pick based on whether you want managed (Pinecone), open-source (Weaviate, Qdrant), or already-have-Postgres (pgvector).
4. Querying — the user’s question
A user asks “What’s our remote-work policy?” That question goes through the same embedding model as the chunks, producing a query vector.
The vector DB returns the top-K most similar chunks (typically K=20 to K=50).
5. Reranking — precision over recall
Vector search optimizes for recall (find the right chunks somewhere in the top 50) but is mediocre at precision (which of those 50 is most relevant?). The fix is a cross-encoder reranker.
Vector DB: fast, top-50, recall ~95%, precision ~30%
Reranker: slow, top-50 → top-5, precision ~80%
A cross-encoder takes the query and a candidate together, processes them as one sequence, outputs a single relevance score. Cohere Rerank, BGE Reranker, or ms-marco-MiniLM are common. You only run it on the 50 vector-DB hits, so the latency is bounded.
6. Prompt assembly — the augmented prompt
Now you have ~5 high-precision chunks. They get stuffed into a prompt template:
You are a helpful assistant. Answer the user's question using ONLY the
context below. Cite each claim with [[chunk_id]]. If the answer isn't
in the context, say "I don't know."
CONTEXT:
[1] {chunk_1_text}
[2] {chunk_2_text}
[3] {chunk_3_text}
[4] {chunk_4_text}
[5] {chunk_5_text}
QUESTION: {user_question}
ANSWER:
7. Generation — the LLM does its thing
The augmented prompt goes to GPT-4, Claude, Gemini, Llama — whichever LLM you’ve picked. It generates an answer constrained by the context.
Same point as before, restated because it’s that important:
8. Evaluation — is it working?
Two layers:
Retrieval evals. A labeled set of (question, ideal_chunk_ids). Measure recall@K, MRR, NDCG. If retrieval misses the right chunks, generation can’t recover.
Faithfulness evals. Does the generated answer actually claim only what the retrieved chunks support? Use a judge model (smaller LLM that scores faithfulness) on a sample of outputs.
A typical baseline: 92%+ retrieval recall@10, 85%+ faithfulness score. Below that and users start to lose trust.
9. Real-world wrinkles
Hybrid search. Combine vector search with BM25 keyword search. Vector handles semantics; BM25 handles exact-match terms (product codes, names, version numbers). Sum or rerank the merged results.
Metadata filtering. Most queries should pre-filter by metadata (user_id, doc_type, date range) before vector search, not after. Filtering after means you might get zero results when the unfiltered top-K had nothing in your scope.
Multi-hop questions. “Compare our 2023 and 2024 vacation policies.” Single-shot retrieval finds one or the other but rarely both. Multi-hop RAG breaks this into sub-questions, retrieves separately, then synthesizes.
Updating the corpus. Use content hashes per chunk. Re-embed only changed chunks. Soft-delete removed docs (filter at query time), then compact periodically.
10. The future RAG is becoming
Pure vector retrieval is getting replaced by agentic RAG — the LLM decides what to retrieve, when, and from which source. It’s also being merged with structured retrieval (SQL, knowledge graphs) for hybrid factual + semantic search.
But the core pipeline — chunk, embed, retrieve, rerank, ground, generate, cite — is now table stakes.
Comments 0
Discuss this page. Markdown supported. Be kind.