Our First RAG System Was Terrible (Here's What We Fixed)
Let me tell you about the first time I demoed our RAG system to the team.
Someone asked: “What’s the deployment process for Service X?” The system confidently returned an answer that described the deployment process for a completely different service. It cited sources — the citations were real documents, they just had nothing to do with the question. Everyone nodded politely but I could see them mentally filing this under “AI hype that doesn’t work.”
That was version 1. It took six months of iteration to get to something people actually trust and use daily. Here’s what went wrong and what we changed.
Version 1: The Tutorial Approach
We followed a standard RAG tutorial. Load documents, chunk them into 200-token pieces, embed with OpenAI, store in ChromaDB, retrieve top-5 similar chunks, feed to Claude, get answer. Simple.
It worked okay for simple factual questions. “What’s the office WiFi password?” — nailed it. But anything requiring context or nuance was a coin flip. The main problems:
Chunks were too small. 200 tokens meant critical context got split. “The API key should be stored in…” ended up in one chunk. “…the environment variable GATEWAY_API_KEY” was in the next chunk. When someone asked about the API key, the system would retrieve the first chunk and hallucinate a wrong variable name.
Vector search is weirdly literal. Someone searched “CORS error in the API gateway.” Vector search returned chunks about “cross-origin resource sharing configuration” but missed the chunk that literally said “CORS” because the embedding models prefer semantic meaning over exact terms. This drove me crazy.
No query understanding. Users ask vague questions. “How do I deploy?” — deploy what? To which environment? Using which tool? The system retrieved a grab bag of deployment-related chunks and Claude generated a generic, unhelpful answer.
Version 2: The Fixes That Mattered
Three changes took us from 62% user satisfaction to 85%.
Fix 1: Chunking Strategy
Switched from fixed 200-token chunks to:
- 500-token chunks with 50-token overlap. Enough context to be useful without too much noise
- Parent-document retrieval. We retrieve on small chunks (for precision) but pass the whole section to Claude (for context). This was the single biggest quality improvement
- Document-type-aware chunking. Technical docs get chunked by section headers. Confluence pages get chunked differently from markdown files
Fix 2: Hybrid Retrieval
Added BM25 keyword matching alongside vector search. The system now does both and merges results using reciprocal rank fusion.
retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
similarity_top_k=10,
mode="reciprocal_rerank"
)
This fixed the “CORS” problem immediately. Vector search finds semantically similar content. BM25 finds exact keyword matches. Together they cover each other’s blind spots. Retrieval recall went from 68% to 89%.
Fix 3: Query Expansion
Before searching, we use Claude Haiku (fast and cheap) to generate 2-3 alternative phrasings of the question. “How do I deploy?” becomes:
- “Docker deployment process for production”
- “CI/CD pipeline deployment steps”
- “Manual deployment guide and commands”
This retrieves a broader, more relevant set of documents. It adds about 200ms of latency per query but the answer quality improvement is worth it.
What We’re Still Working On
The remaining 15% of bad answers fall into two buckets:
Stale content. When someone updates a Confluence page, the old embeddings still exist until the next re-index. We run re-indexing nightly, but sometimes answers are based on yesterday’s version of a doc.
Multi-document reasoning. “What are the differences between how Service A and Service B handle authentication?” requires reasoning across two separate docs. The retriever finds chunks from both, but Claude sometimes struggles to synthesize a coherent comparison. We’re experimenting with a two-stage approach: retrieve, then do a follow-up retrieval based on the initial results.
Numbers
| Metric | V1 | V2 |
|---|---|---|
| Retrieval recall@10 | 68% | 89% |
| Answer accuracy | 74% | 92% |
| Answer relevance | 71% | 88% |
| User satisfaction | 62% | 85% |
| Queries per day | ~30 | ~200 |
The query volume tells the real story. When the system gives good answers, people use it. When it doesn’t, they go back to asking colleagues in Slack.
What I’d Tell Someone Starting a RAG Project
- Fix retrieval before you upgrade models. Better retrieval has 10x the impact of a better LLM. Our improvements came entirely from retrieval changes. We didn’t change the model once.
- Build an evaluation set immediately. 200 question-answer pairs with known source documents. Without this, you’re optimizing blind.
- Don’t embed everything. Outdated docs, irrelevant docs, duplicate docs — they all pollute your results. Curate your corpus.
- Hybrid retrieval is not optional. Pure vector search will fail you on exact terms, project names, error codes. Add keyword matching.
- Talk to your users. The thumbs-up/thumbs-down feedback button taught us more about our system’s weaknesses than any metric.
RAG isn’t hard to build. RAG that people trust enough to actually use? That took us six months.