RAG — Retrieval-Augmented Generation

Give LLMs access to your own data — how RAG works, vector embeddings, chunking strategies, re-ranking, and building production RAG pipelines.

Starting from Zero — A Physical Intuition

Before looking at pipeline architectures, let's understand RAG through a physical analogy:

The Closed-Book Exam (Standard LLM): Imagine sitting in a classroom taking a closed-book history exam. You must rely solely on the facts you memorized months ago during study sessions. If the exam asks about an event that occurred after you studied, you will either fail to answer or confidently guess the wrong information. This is an LLM responding from its frozen pre-trained weights.
The Open-Book Exam (RAG): Now, imagine the teacher allows you to bring a massive library of books. Since you can't read all 10,000 pages in the 60-minute exam window, you hire a fast assistant (the Retrieval system) who reads the exam question, runs to the bookshelves, pulls out the exact 3 pages containing the answers, and highlights them for you. You then read those 3 highlighted pages (the Context) and write a perfect answer. This is RAG.

The LLM Knowledge Problem

LLMs have a fixed knowledge cutoff — they don't know about your internal documents, your codebase, your company's Notion wiki, or anything that happened after training.

Problem:
  User: "What did our Q3 revenue report say about APAC growth?"
  LLM: "I don't have access to your Q3 report."
  
  OR worse — it hallucinates a confident-sounding answer.

Solution: RAG
  Retrieve the relevant document chunks at query time →
  Inject them into the LLM's context →
  LLM answers based on actual, up-to-date information

RAG (Retrieval-Augmented Generation) combines a retrieval system (search) with a generative model (LLM). It's the most practical way to ground LLMs in private, current, or large-scale knowledge.

How RAG Works

INDEXING PHASE (done once, or on updates):

  Documents → Chunk → Embed → Store in Vector DB

  1. Load documents (PDFs, Notion, Confluence, GitHub, etc.)
  2. Split into chunks (e.g., 500 tokens with 50-token overlap)
  3. Generate embeddings (dense vectors) for each chunk
  4. Store chunks + their vectors in a vector database

RETRIEVAL PHASE (every query):

  User query → Embed query → Search vector DB → Retrieve top-K chunks
       ↓
  Inject retrieved chunks into LLM prompt (as context)
       ↓
  LLM generates answer grounded in the retrieved context

The LLM never sees the full knowledge base — only the most relevant pieces.

Embeddings — The Core Primitive

An embedding is a dense vector (list of floating-point numbers) that represents the semantic meaning of text. Similar meanings → similar vectors → close in vector space.

Python

from openai import OpenAI
client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",  # 1536 dimensions, cheap
        # or: text-embedding-3-large    # 3072 dimensions, better
        input=text,
    )
    return response.data[0].embedding

# Example
vec1 = embed("Python is a programming language")
vec2 = embed("NumPy is a Python library for numerical computation")
vec3 = embed("The Eiffel Tower is in Paris")

# vec1 and vec2 will be much closer (cosine similarity ~0.85)
# than vec1 and vec3 (~0.30)

Common embedding models:

text-embedding-3-small (OpenAI) — fast, cheap, good quality
text-embedding-3-large (OpenAI) — best quality, more expensive
all-MiniLM-L6-v2 (Sentence Transformers, open-source) — great for self-hosted
voyage-large-2 (Voyage AI) — often best for code/technical content

Chunking Strategies

How you split documents significantly affects RAG quality:

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Fixed-size chunking with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # ~375 words
    chunk_overlap=50,      # overlap prevents context being cut in half
    separators=["\n\n", "\n", " ", ""],  # try to split on natural boundaries
)

chunks = splitter.split_text(document_text)

# Each chunk becomes a separate record in the vector DB

Chunking strategies and tradeoffs:

Strategy	Good For	Watch Out
Fixed-size (e.g., 500 tokens)	Simple, consistent	May cut mid-sentence or mid-concept
Semantic (paragraph-based)	Natural boundaries	Variable chunk sizes
Hierarchical (doc → section → paragraph)	Complex documents	More complex retrieval
Sliding window	Dense technical content	Redundant overlap

Practical rules:

Shorter chunks (200-400 tokens): precise retrieval, less context per chunk
Longer chunks (800-1200 tokens): more context, may dilute relevance score
Always include metadata: document source, page number, created date, section title

Think it through like the interview

Don't just split text randomly — derive the correct chunk size based on context density and model limits.

Think it through: Chunking StrategiesData Ingestion Engineering0/3 stages

PROBLEMDesign an ingestion pipeline for a 1,000-page engineering PDF. Decide on a chunking strategy to balance retrieval accuracy and context coherence.

1
Evaluate fixed-size chunking
“What happens if we split the PDF into fixed blocks of exactly 200 tokens without any overlap?”
2
Optimize with sliding-window overlap
“How does adding a 50-token overlap resolve this boundary problem? What are the tradeoffs?”
unlocks after the stage above
3
Select chunk size based on content density
“Should we use tiny chunks (100 tokens) or large chunks (1,000 tokens) for technical documentation containing API schemas and code snippets?”
unlocks after the stage above

Building a RAG Pipeline

Python

import os
from openai import OpenAI
import chromadb

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# 1. Initialize vector store (ChromaDB — local, easy to start with)
chroma = chromadb.Client()
collection = chroma.get_or_create_collection("knowledge_base")

def embed(text: str) -> list[float]:
    return client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    ).data[0].embedding

# 2. INDEX: add documents to the vector store
def index_document(doc_id: str, text: str, metadata: dict):
    chunks = split_into_chunks(text, chunk_size=500, overlap=50)
    for i, chunk in enumerate(chunks):
        embedding = embed(chunk)
        collection.add(
            ids=[f"{doc_id}-chunk-{i}"],
            embeddings=[embedding],
            documents=[chunk],
            metadatas=[{**metadata, "chunk_index": i}]
        )

# Index your knowledge base
index_document("q3-report", q3_report_text, {"source": "Q3 Report", "year": 2024})
index_document("docs/api", api_docs_text, {"source": "API Docs", "version": "v2"})

# 3. RETRIEVE: find relevant chunks for a query
def retrieve(query: str, top_k: int = 5) -> list[dict]:
    query_embedding = embed(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
    )
    return [
        {"text": doc, "metadata": meta}
        for doc, meta in zip(results["documents"][0], results["metadatas"][0])
    ]

# 4. GENERATE: answer the query using retrieved context
def rag_answer(question: str) -> str:
    # Retrieve relevant chunks
    chunks = retrieve(question, top_k=5)

    # Build context string
    context = "\n\n---\n\n".join(
        f"[Source: {c['metadata']['source']}]\n{c['text']}"
        for c in chunks
    )

    # Generate answer with context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant. Answer questions based ONLY on the provided context.
If the context doesn't contain enough information to answer, say so clearly.
Never make up information."""
            },
            {
                "role": "user",
                "content": f"""Context:
{context}

Question: {question}"""
            }
        ],
        temperature=0,  # factual retrieval → low temperature
    )
    return response.choices[0].message.content

# Usage
answer = rag_answer("What did the Q3 report say about APAC growth?")
print(answer)

Advanced RAG Techniques

Hybrid Search (BM25 + Vector)

Pure vector search misses exact keyword matches. Hybrid combines both:

Python

# BM25: keyword-based (finds "APAC" even if embedding misses it)
# Vector: semantic (finds "Asia Pacific" when query says "APAC")

# Reciprocal Rank Fusion (RRF) to combine results:
def hybrid_search(query: str, top_k: int = 5):
    keyword_results = bm25_search(query, top_k=top_k * 2)
    vector_results  = vector_search(query, top_k=top_k * 2)
    
    # RRF: combine ranks
    scores = {}
    for rank, result in enumerate(keyword_results):
        scores[result.id] = scores.get(result.id, 0) + 1 / (60 + rank)
    for rank, result in enumerate(vector_results):
        scores[result.id] = scores.get(result.id, 0) + 1 / (60 + rank)
    
    return sorted(scores, key=scores.get, reverse=True)[:top_k]

Re-ranking

Retrieve more candidates (top-20), then re-rank with a cross-encoder (reads query + document together — much more accurate but slow):

Python

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(query: str, final_k: int = 5):
    # Retrieve large candidate set
    candidates = retrieve(query, top_k=20)
    
    # Re-rank with cross-encoder (reads query+chunk together)
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    
    # Sort by cross-encoder score
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:final_k]]

Metadata Filtering

Python

# Only retrieve from a specific document or date range
results = collection.query(
    query_embeddings=[embed(query)],
    n_results=10,
    where={"source": "API Docs", "version": "v2"},  # filter by metadata
)

Query Translation & Rewriting

Standard RAG struggles when the user's query is poorly formulated or complex. Query translation solves this:

Multi-Query Retrieval: Generate multiple variations of the user's query using an LLM. Search the vector DB for all variants and take the union of the results. This improves recall.
Sub-Query Decomposition: Split a complex query (e.g., "Compare the Q3 financial performance of Apple and Microsoft") into sub-queries ("Apple Q3 financial performance", "Microsoft Q3 financial performance"), run them independently, and merge.
HyDE (Hypothetical Document Embeddings): Ask the LLM to write a hypothetical answer to the query first. Use the hypothetical answer's embedding to search the database. Since document-to-document similarity is higher than question-to-document similarity, this improves search relevance.

Python

# HyDE conceptual code
def get_hyde_embedding(query: str) -> list[float]:
    # 1. Generate hypothetical response
    hypothetical_doc = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Write a paragraph answering: {query}"}],
        temperature=0.5
    ).choices[0].message.content
    
    # 2. Embed the hypothetical response
    return embed(hypothetical_doc)

Multi-Vector Retrieval (Parent-Child)

In standard RAG, the text you embed is identical to the text sent to the LLM. Multi-Vector Retrieval decouples them:

Child Chunks: Small, precise chunks (e.g., 100-200 tokens) are embedded and stored in the vector database. Small chunks generate higher-similarity search matches.
Parent Document: When a child chunk matches, the retriever fetches the larger parent chunk (e.g., 1000 tokens) or the full page containing the child.
Summary Retrieval: Generate a summary of each document, embed the summaries, and map matches to the full original documents.

GraphRAG vs. Agentic RAG

As RAG matures, systems evolve from static lookup pipelines to dynamic architectures:

GraphRAG: Combines vector database search with a Knowledge Graph (entities like people, concepts, products as nodes; and their interactions as edges). It excels at global queries (e.g., "What are the recurring issues in all customer transcripts?"), where standard vector search fails because it only fetches localized snippets.
Agentic RAG: Replaces static retrieval with an autonomous LLM Agent that has access to search tools. The agent can:
1. Inspect search results.
2. Decide if the retrieved information is sufficient.
3. If not, rewrite the query and search again (multi-hop retrieval).
4. Resolve conflicting information from multiple documents before compiling the answer.

Common Failure Modes & Fixes

Problem	Cause	Fix
LLM ignores context, uses training data	Weak system prompt	Explicitly instruct "use ONLY the provided context"
Wrong chunks retrieved	Poor chunking / embedding mismatch	Use domain-specific embeddings; improve chunking
Context too long, LLM loses focus	Too many chunks retrieved	Reduce top-K, use re-ranking
Hallucination in retrieved answer	Context doesn't contain the answer	Add "I don't know" path; use grounding checks
Slow retrieval	Unindexed vector DB	Add HNSW index; use ANN (approximate nearest neighbor)

Basic: Build a RAG system over a set of your own markdown notes. Use ChromaDB + OpenAI embeddings. Ask questions about the notes.
Pipeline: Add metadata filtering, source citation in answers, and a "no relevant information found" fallback.
Advanced: Compare pure vector search vs hybrid search (BM25 + vector) on a technical document set. Measure retrieval quality.

Next: Agents — giving LLMs tools and autonomy.

RAG — Retrieval-Augmented Generation

Starting from Zero — A Physical Intuition

The LLM Knowledge Problem

How RAG Works

Embeddings — The Core Primitive

Chunking Strategies

Think it through like the interview

Building a RAG Pipeline

Advanced RAG Techniques

Hybrid Search (BM25 + Vector)

Re-ranking

Metadata Filtering

Query Translation & Rewriting

Multi-Vector Retrieval (Parent-Child)

GraphRAG vs. Agentic RAG

Common Failure Modes & Fixes

Common Interview Questions

Interactive Quiz

Practice