intermediate ~90 min updated 2026-06-01
LLM RAG Application Basics
Build a retrieval-augmented generation pipeline from scratch: chunk documents, embed them into a local Chroma vector store, retrieve relevant context, and answer questions with an LLM.
Objective
Implement document chunking, embedding, vector search, and grounded answer generation in a small Python RAG application backed by Chroma. You will see exactly why retrieval quality — not the LLM — usually decides whether a RAG system gives correct answers.
Prerequisites
- Python 3.10 or newer
- pip and the venv module
- An Anthropic API key exported as
ANTHROPIC_API_KEY(any chat-completions LLM works with minor changes) - Basic Python skills; no ML background required
Architecture
Ingestion: markdown documents are split into overlapping chunks, embedded with a local sentence-transformers model (no API cost), and stored in a persistent Chroma collection. Query: the user question is embedded with the same model, the top-k most similar chunks are retrieved, and the LLM answers using only that context.
docs/*.md question
| |
v v
chunker (500 chars, same model embed query
100 overlap) | |
| v v
embeddings (all-MiniLM-L6-v2) Chroma top-k search
| |
v v
Chroma collection ------------------> context chunks
(persistent, ./chroma) |
v
LLM prompt: context + question
|
v
grounded answer
Steps
1. Set up the project
mkdir rag-lab && cd rag-lab
python3 -m venv .venv && source .venv/bin/activate
pip install chromadb sentence-transformers anthropic
mkdir docs
2. Create a small knowledge base
cat > docs/runbook.md <<'EOF'
# Payments Service Runbook
The payments service runs in the prod-payments namespace with 6 replicas.
On-call escalation: page the payments-oncall rotation in PagerDuty.
If error rate exceeds 2 percent for 10 minutes, roll back with
"helm rollback payments" and open a SEV2 incident.
The service depends on Postgres (payments-db) and Redis for idempotency keys.
EOF
cat > docs/policy.md <<'EOF'
# Deployment Policy
Production deploys are allowed Monday to Thursday, 09:00-16:00 UTC.
Every deploy requires a green canary for 30 minutes at 5 percent traffic.
Database migrations must be backward compatible and ship one release
before the code that depends on them.
EOF
3. Build the ingestion script
# ingest.py
import glob
import chromadb
from chromadb.utils import embedding_functions
CHUNK_SIZE, OVERLAP = 500, 100
def chunk(text):
chunks, i = [], 0
while i < len(text):
chunks.append(text[i:i + CHUNK_SIZE])
i += CHUNK_SIZE - OVERLAP
return chunks
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
client = chromadb.PersistentClient(path="./chroma")
col = client.get_or_create_collection("kb", embedding_function=ef)
ids, texts, metas = [], [], []
for path in glob.glob("docs/*.md"):
for n, c in enumerate(chunk(open(path).read())):
ids.append(f"{path}-{n}")
texts.append(c)
metas.append({"source": path})
col.upsert(ids=ids, documents=texts, metadatas=metas)
print(f"Indexed {col.count()} chunks from {len(set(m['source'] for m in metas))} files")
python ingest.py
4. Test retrieval on its own
# search.py
import sys
import chromadb
from chromadb.utils import embedding_functions
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
col = chromadb.PersistentClient(path="./chroma").get_collection(
"kb", embedding_function=ef
)
res = col.query(query_texts=[sys.argv[1]], n_results=2)
for doc, meta, dist in zip(res["documents"][0], res["metadatas"][0], res["distances"][0]):
print(f"--- {meta['source']} (distance {dist:.3f})")
print(doc[:200], "\n")
python search.py "when can we deploy to production?"
5. Wire retrieval into the LLM
# ask.py
import sys
import chromadb
from chromadb.utils import embedding_functions
import anthropic
question = sys.argv[1]
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
col = chromadb.PersistentClient(path="./chroma").get_collection(
"kb", embedding_function=ef
)
res = col.query(query_texts=[question], n_results=3)
context = "\n\n".join(res["documents"][0])
prompt = (
"Answer the question using ONLY the context below. "
"If the context does not contain the answer, say you do not know.\n\n"
f"Context:\n{context}\n\nQuestion: {question}"
)
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
msg = client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
messages=[{"role": "user", "content": prompt}],
)
print(msg.content[0].text)
6. Ask grounded and ungrounded questions
export ANTHROPIC_API_KEY=sk-ant-...
python ask.py "What should I do if the payments error rate goes above 2 percent?"
python ask.py "Who is the CEO of the company?"
The second answer should be a refusal — that is the grounding guardrail working.
Expected output
$ python ingest.py
Indexed 4 chunks from 2 files
$ python search.py "when can we deploy to production?"
--- docs/policy.md (distance 0.512)
# Deployment Policy
Production deploys are allowed Monday to Thursday, 09:00-16:00 UTC...
$ python ask.py "What should I do if the payments error rate goes above 2 percent?"
If the error rate exceeds 2 percent for 10 minutes, roll back the release
with "helm rollback payments" and open a SEV2 incident. Escalate by paging
the payments-oncall rotation in PagerDuty.
$ python ask.py "Who is the CEO of the company?"
I do not know — the provided context does not contain that information.
Troubleshooting
- First run downloads a large model slowly:
all-MiniLM-L6-v2(~90 MB) is fetched once to~/.cache. Wait for it; later runs are instant. anthropic.AuthenticationError:ANTHROPIC_API_KEYis unset or wrong.echo $ANTHROPIC_API_KEYshould print a key starting withsk-ant-.- Retrieval returns the wrong chunk: chunks are too large or too small for your docs. Tune
CHUNK_SIZE/OVERLAP, re-runingest.py, and inspect distances withsearch.pybefore blaming the LLM. Collection kb does not exist: you ranask.pybeforeingest.py, or from a different directory so the./chromapath differs. Run everything fromrag-lab/.- Answers ignore the context: the prompt assembly failed (empty context). Print
contextbefore the API call and confirm chunks were retrieved.
Cleanup
deactivate
cd .. && rm -rf rag-lab
unset ANTHROPIC_API_KEY