intermediate ~90 min updated 2026-06-01

LLM RAG Application Basics

Build a retrieval-augmented generation pipeline from scratch: chunk documents, embed them into a local Chroma vector store, retrieve relevant context, and answer questions with an LLM.

Objective

Implement document chunking, embedding, vector search, and grounded answer generation in a small Python RAG application backed by Chroma. You will see exactly why retrieval quality — not the LLM — usually decides whether a RAG system gives correct answers.

Prerequisites

Python 3.10 or newer
pip and the venv module
An Anthropic API key exported as ANTHROPIC_API_KEY (any chat-completions LLM works with minor changes)
Basic Python skills; no ML background required

Architecture

Ingestion: markdown documents are split into overlapping chunks, embedded with a local sentence-transformers model (no API cost), and stored in a persistent Chroma collection. Query: the user question is embedded with the same model, the top-k most similar chunks are retrieved, and the LLM answers using only that context.

 docs/*.md                                question
    |                                        |
    v                                        v
 chunker (500 chars,        same model   embed query
   100 overlap)                 |            |
    |                           v            v
 embeddings (all-MiniLM-L6-v2)        Chroma top-k search
    |                                        |
    v                                        v
 Chroma collection  ------------------>  context chunks
 (persistent, ./chroma)                      |
                                             v
                              LLM prompt: context + question
                                             |
                                             v
                                      grounded answer

Steps

1. Set up the project

mkdir rag-lab && cd rag-lab
python3 -m venv .venv && source .venv/bin/activate
pip install chromadb sentence-transformers anthropic
mkdir docs

2. Create a small knowledge base

cat > docs/runbook.md <<'EOF'
# Payments Service Runbook
The payments service runs in the prod-payments namespace with 6 replicas.
On-call escalation: page the payments-oncall rotation in PagerDuty.
If error rate exceeds 2 percent for 10 minutes, roll back with
"helm rollback payments" and open a SEV2 incident.
The service depends on Postgres (payments-db) and Redis for idempotency keys.
EOF

cat > docs/policy.md <<'EOF'
# Deployment Policy
Production deploys are allowed Monday to Thursday, 09:00-16:00 UTC.
Every deploy requires a green canary for 30 minutes at 5 percent traffic.
Database migrations must be backward compatible and ship one release
before the code that depends on them.
EOF

3. Build the ingestion script

# ingest.py
import glob
import chromadb
from chromadb.utils import embedding_functions

CHUNK_SIZE, OVERLAP = 500, 100

def chunk(text):
    chunks, i = [], 0
    while i < len(text):
        chunks.append(text[i:i + CHUNK_SIZE])
        i += CHUNK_SIZE - OVERLAP
    return chunks

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
client = chromadb.PersistentClient(path="./chroma")
col = client.get_or_create_collection("kb", embedding_function=ef)

ids, texts, metas = [], [], []
for path in glob.glob("docs/*.md"):
    for n, c in enumerate(chunk(open(path).read())):
        ids.append(f"{path}-{n}")
        texts.append(c)
        metas.append({"source": path})

col.upsert(ids=ids, documents=texts, metadatas=metas)
print(f"Indexed {col.count()} chunks from {len(set(m['source'] for m in metas))} files")

python ingest.py

4. Test retrieval on its own

# search.py
import sys
import chromadb
from chromadb.utils import embedding_functions

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
col = chromadb.PersistentClient(path="./chroma").get_collection(
    "kb", embedding_function=ef
)
res = col.query(query_texts=[sys.argv[1]], n_results=2)
for doc, meta, dist in zip(res["documents"][0], res["metadatas"][0], res["distances"][0]):
    print(f"--- {meta['source']} (distance {dist:.3f})")
    print(doc[:200], "\n")

python search.py "when can we deploy to production?"

5. Wire retrieval into the LLM

# ask.py
import sys
import chromadb
from chromadb.utils import embedding_functions
import anthropic

question = sys.argv[1]

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
col = chromadb.PersistentClient(path="./chroma").get_collection(
    "kb", embedding_function=ef
)
res = col.query(query_texts=[question], n_results=3)
context = "\n\n".join(res["documents"][0])

prompt = (
    "Answer the question using ONLY the context below. "
    "If the context does not contain the answer, say you do not know.\n\n"
    f"Context:\n{context}\n\nQuestion: {question}"
)

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY
msg = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=300,
    messages=[{"role": "user", "content": prompt}],
)
print(msg.content[0].text)

6. Ask grounded and ungrounded questions

export ANTHROPIC_API_KEY=sk-ant-...
python ask.py "What should I do if the payments error rate goes above 2 percent?"
python ask.py "Who is the CEO of the company?"

The second answer should be a refusal — that is the grounding guardrail working.

Expected output

$ python ingest.py
Indexed 4 chunks from 2 files

$ python search.py "when can we deploy to production?"
--- docs/policy.md (distance 0.512)
# Deployment Policy
Production deploys are allowed Monday to Thursday, 09:00-16:00 UTC...

$ python ask.py "What should I do if the payments error rate goes above 2 percent?"
If the error rate exceeds 2 percent for 10 minutes, roll back the release
with "helm rollback payments" and open a SEV2 incident. Escalate by paging
the payments-oncall rotation in PagerDuty.

$ python ask.py "Who is the CEO of the company?"
I do not know — the provided context does not contain that information.

Troubleshooting

First run downloads a large model slowly: all-MiniLM-L6-v2 (~90 MB) is fetched once to ~/.cache. Wait for it; later runs are instant.
anthropic.AuthenticationError: ANTHROPIC_API_KEY is unset or wrong. echo $ANTHROPIC_API_KEY should print a key starting with sk-ant-.
Retrieval returns the wrong chunk: chunks are too large or too small for your docs. Tune CHUNK_SIZE/OVERLAP, re-run ingest.py, and inspect distances with search.py before blaming the LLM.
Collection kb does not exist: you ran ask.py before ingest.py, or from a different directory so the ./chroma path differs. Run everything from rag-lab/.
Answers ignore the context: the prompt assembly failed (empty context). Print context before the API call and confirm chunks were retrieved.

Cleanup

deactivate
cd .. && rm -rf rag-lab
unset ANTHROPIC_API_KEY