🔎

AI Systems

RAG Pipeline

Retrieval-augmented generation: how chatbots cite sources without hallucinating.

TL;DR

Retrieval-Augmented Generation grounds an LLM in your data without retraining it. You chunk your documents, embed each chunk into a vector, store them in a vector DB, embed the user's question with the same model, retrieve the top-K most similar chunks, optionally rerank them with a cross-encoder, stuff the survivors into the prompt, and let the LLM answer with citations. The model isn't smarter — the pipeline is.

When to use

Any time you want an LLM to answer questions about data it wasn't trained on — internal docs, support tickets, product catalogs, code, legal contracts, medical records, fresh news. RAG is also the right tool when answers must be **citable** (regulated industries, customer support, legal, medical). It's the wrong tool when you need persistent reasoning across a long thread (use long context or fine-tuning) or when the data is so small it fits in the prompt directly.

The problem RAG solves

You give an LLM a question about your company’s HR policy. It answers confidently — and wrongly. Not maliciously, just statistically. The model has no idea what your policy actually says. It saw thousands of HR docs in training and reaches for the most plausible-sounding answer.

This is the grounding problem. Without something to ground its output in, an LLM is a fluent guesser.

There are three ways to fix it:

Fine-tune — bake your data into the model’s weights (slow, expensive, has to be redone for every update)
Long context — paste all your data into the prompt (costs scale linearly, fails at hundreds of pages)
Retrieval-Augmented Generation — fetch only the relevant data on demand (fast, cheap, fresh)

RAG is the dominant production pattern because it’s the only one that scales to millions of docs, updates instantly, and works with any LLM.

The pipeline at a glance

Every RAG system, from a 10-line LangChain demo to OpenAI’s enterprise stack, is the same shape:

flowchart LR
    subgraph Indexing [Indexing — done once / on update]
        D[Documents<br/>PDFs, wikis, code]
        D --> C[Chunk]
        C --> E1[Embed]
        E1 --> V[(Vector DB)]
    end

    subgraph Query [Query — every request]
        Q[User question]
        Q --> E2[Embed]
        E2 --> R[Retrieve top-K]
        R --> RR[Rerank]
        RR --> P[Prompt assembly]
        P --> L[LLM]
        L --> A[Answer + citations]
    end

    V -.shared embedding<br/>model.-> E2
    V -.top-K chunks.-> R

    style D fill:#1c2333,stroke:#475569,color:#e7eaf1
    style C fill:#1e3a8a,stroke:#3b82f6,color:#fff
    style E1 fill:#581c87,stroke:#a855f7,color:#fff
    style E2 fill:#581c87,stroke:#a855f7,color:#fff
    style V fill:#0f1320,stroke:#475569,color:#cdd3df
    style Q fill:#1c2333,stroke:#475569,color:#e7eaf1
    style R fill:#0e7490,stroke:#06b6d4,color:#fff
    style RR fill:#9a3412,stroke:#f97316,color:#fff
    style P fill:#365314,stroke:#84cc16,color:#fff
    style L fill:#7e1d1d,stroke:#ef4444,color:#fff
    style A fill:#1c2333,stroke:#475569,color:#e7eaf1

1. Chunking — splitting your knowledge into searchable units

You start with documents. PDFs, wikis, ticket histories, codebases, contracts. Step one is to split them into chunks.

Why chunk? Two reasons:

Embedding models have a context limit (typically 512–8192 tokens)
Retrieval should return the paragraph that answers the question, not the whole 50-page document

Chunking strategies, ranked by sophistication:

1. Fixed-size — every 512 tokens, hard cut. Simple, lossy at boundaries.
2. Sliding overlap — 512 tokens with 50-token overlap. Standard.
3. Semantic — split at sentence boundaries, group sentences with similar embeddings.
4. Structural — for code: split by function. For markdown: by heading. For tables: by row.
5. Hierarchical — chunk at multiple resolutions (paragraph + section + chapter), retrieve hierarchically.

2. Embedding — turning text into vectors

Each chunk passes through an embedding model — a smaller transformer (or sometimes a bi-encoder like text-embedding-3-small) that outputs a fixed-dimensional vector, typically 768 or 1536 dimensions.

The magic property of embedding spaces:

Cost calculus: embedding 1 million chunks at 512 tokens each is ~~500M tokens. At OpenAI Ada-3 prices (~~$0.02/M tokens for embeddings), that’s $10. Cheap.

3. Storing — the vector database

A million chunks × 1536 dimensions × 4 bytes = 6 GB. Not huge. But the hard part is searching it fast.

Brute force (compute cosine similarity to every vector) is O(N) per query — at a million chunks that’s already too slow. Vector DBs use approximate nearest neighbor (ANN) indexes — HNSW graphs, IVF clusters, or product quantization (see the Vector DB article for details) — to drop this to O(log N) or O(√N).

Production options: Pinecone, Weaviate, Milvus, Qdrant, pgvector for Postgres, Elasticsearch with kNN. Pick based on whether you want managed (Pinecone), open-source (Weaviate, Qdrant), or already-have-Postgres (pgvector).

4. Querying — the user’s question

A user asks “What’s our remote-work policy?” That question goes through the same embedding model as the chunks, producing a query vector.

The vector DB returns the top-K most similar chunks (typically K=20 to K=50).

5. Reranking — precision over recall

Vector search optimizes for recall (find the right chunks somewhere in the top 50) but is mediocre at precision (which of those 50 is most relevant?). The fix is a cross-encoder reranker.

Vector DB:    fast, top-50, recall ~95%, precision ~30%
Reranker:     slow, top-50 → top-5, precision ~80%

A cross-encoder takes the query and a candidate together, processes them as one sequence, outputs a single relevance score. Cohere Rerank, BGE Reranker, or ms-marco-MiniLM are common. You only run it on the 50 vector-DB hits, so the latency is bounded.

6. Prompt assembly — the augmented prompt

Now you have ~5 high-precision chunks. They get stuffed into a prompt template:

You are a helpful assistant. Answer the user's question using ONLY the
context below. Cite each claim with [[chunk_id]]. If the answer isn't
in the context, say "I don't know."

CONTEXT:
[1] {chunk_1_text}
[2] {chunk_2_text}
[3] {chunk_3_text}
[4] {chunk_4_text}
[5] {chunk_5_text}

QUESTION: {user_question}

ANSWER:

7. Generation — the LLM does its thing

The augmented prompt goes to GPT-4, Claude, Gemini, Llama — whichever LLM you’ve picked. It generates an answer constrained by the context.

Same point as before, restated because it’s that important:

8. Evaluation — is it working?

Two layers:

Retrieval evals. A labeled set of (question, ideal_chunk_ids). Measure recall@K, MRR, NDCG. If retrieval misses the right chunks, generation can’t recover.

Faithfulness evals. Does the generated answer actually claim only what the retrieved chunks support? Use a judge model (smaller LLM that scores faithfulness) on a sample of outputs.

A typical baseline: 92%+ retrieval recall@10, 85%+ faithfulness score. Below that and users start to lose trust.

9. Real-world wrinkles

Hybrid search. Combine vector search with BM25 keyword search. Vector handles semantics; BM25 handles exact-match terms (product codes, names, version numbers). Sum or rerank the merged results.

Metadata filtering. Most queries should pre-filter by metadata (user_id, doc_type, date range) before vector search, not after. Filtering after means you might get zero results when the unfiltered top-K had nothing in your scope.

Multi-hop questions. “Compare our 2023 and 2024 vacation policies.” Single-shot retrieval finds one or the other but rarely both. Multi-hop RAG breaks this into sub-questions, retrieves separately, then synthesizes.

Updating the corpus. Use content hashes per chunk. Re-embed only changed chunks. Soft-delete removed docs (filter at query time), then compact periodically.

10. The future RAG is becoming

Pure vector retrieval is getting replaced by agentic RAG — the LLM decides what to retrieve, when, and from which source. It’s also being merged with structured retrieval (SQL, knowledge graphs) for hybrid factual + semantic search.

But the core pipeline — chunk, embed, retrieve, rerank, ground, generate, cite — is now table stakes.

🎨 Visualization soon

An interactive diagram you can hover, click, and explore.

💻 Code Phase 4 soon

A 30-line build challenge with starter code, hints, and a reference implementation.

🎯 Common interview questions

Q1. Why chunk documents instead of embedding the whole document? ▾

Two reasons. (1) Embedding models have a context window — usually 512 to 8192 tokens. A long PDF won't fit. (2) Retrieval relevance is finer-grained than the document. A 50-page contract has one paragraph that answers a question; you want to retrieve that paragraph, not the whole contract. Chunking with overlap (say, 512 tokens with 50-token overlap) gives the embedder enough context per chunk while keeping retrieval precise.

Q2. Why use a cross-encoder reranker after vector search? ▾

Vector search is fast but lossy — embeddings compress meaning into ~1500 floats and you're searching by approximate cosine similarity. The top-K from vector search has high recall but mediocre precision. A cross-encoder reranker takes the (query, candidate) pair as a single sequence and runs it through a transformer that outputs a single relevance score. Slower per-pair, but only run on top 50–100 candidates. Massive precision win.

Q3. How do you stop the LLM from making things up even with retrieval? ▾

Three layers. (1) Prompt the model strictly — "Answer ONLY using the provided context. If the answer isn't in the context, say I don't know." (2) Require the model to emit citations in a structured format (`[[chunk_id]]`) — answers without citations get rejected. (3) Run a faithfulness check — does each claim in the answer trace back to a sentence in the retrieved chunks? Modern systems use a small judge model for this.

Q4. What's the right chunk size? ▾

There's no universal right answer, but the sweet spot for prose is 256–512 tokens with 10–20% overlap. For code, chunk by function or class boundaries (semantic chunking). For tables, keep each row + headers as a unit. The rule of thumb — small enough to be specific, large enough to carry context. Always measure retrieval quality on a labeled eval set before committing to a chunking strategy.

Q5. How do you handle updates — re-embed the whole corpus? ▾

No. Use incremental indexing. Tag each chunk with a stable doc_id and a content hash. When a doc changes, recompute hashes, find changed chunks, re-embed and upsert only those. Vector DBs (Pinecone, Weaviate, pgvector with HNSW) support upserts natively. For deletions, soft-delete first (filter at query time), then run a periodic compaction job.

↗ Related concepts

Comments 0

Discuss this page. Markdown supported. Be kind.

Loading…

Loading comments…