How do you improve RAG retrieval accuracy when it’s not retrieving the right documents?
To improve RAG retrieval accuracy, stop guessing and start measuring: build a small eval set, track recall@k, and inspect failed queries. The highest-impact fixes are usually hybrid retrieval (BM25 + vector), better chunking + metadata, domain-appropriate embedding models, and a reranker. Most “RAG hallucinations” are retrieval failures, not model failures.
Why this matters for 10–200 person teams
A RAG system that sounds confident and is wrong is worse than no RAG system.
- Internal knowledge bots teach employees the wrong process.
- Support bots cite the wrong policy and create refunds.
- Sales enablement bots quote outdated pricing.
Most teams respond by swapping the LLM.
But in production, the LLM is often not the bottleneck. The retrieval layer is.
If you’re not retrieving the right evidence, you’re asking the model to improvise. That’s when you get “hallucinations.”
Actionable steps: a retrieval-first fix plan
1) Build a tiny eval set before you change anything
You can do this in a day.
- Pick 25–50 real questions users ask.
- For each question, label the “must-have” source docs or doc sections.
- Store these pairs as your gold set.
Now you can answer one crucial question: did retrieval improve?
Without this, you’ll ship changes based on vibes.
2) Measure retrieval separately from generation
Start with simple metrics:
- Recall@k: did the correct doc appear in the top k retrieved chunks?
- MRR: how high was the first correct hit?
- Coverage: how many questions have any correct chunk retrieved?
If recall@k is low, you don’t have a generation problem. You have a search problem.
3) Fix chunking and metadata, because most teams chunk blindly
Chunking mistakes create “lost context.”
Rules of thumb that actually help:
- Chunk by structure: headings, sections, table rows, FAQ entries
- Keep references with their definitions
- Include metadata fields that users implicitly query: product, region, effective date, doc type
- Store canonical URLs and timestamps so you can cite and filter
If you chunk a policy into random 500-token blobs, you’re making the retriever’s job harder than it needs to be.
4) Use hybrid retrieval: BM25 + vector
Vector search is good at semantic similarity. BM25 is good at exact terms, IDs, and rare keywords.
In real org docs, you have both:
- exact: plan names, ticket IDs, error codes, SKU numbers
- semantic: “how do I reset access?”
Hybrid retrieval is often the simplest “big win” when users say: “it keeps missing the obvious doc.”
5) Rerank the top results
Most retrieval stacks retrieve fast and rank shallow.
A reranker does the opposite: it spends more compute to order the top candidates correctly.
Practical approach:
- retrieve top 50 with hybrid
- rerank to top 5 with a cross-encoder or LLM-based reranker
- pass only top 3–5 chunks to the generator
This improves accuracy and cuts token cost.
6) Revisit embedding model choice with your domain in mind
This is the part people avoid because it feels “researchy.”
But if your data is domain-specific, generic embeddings can underperform.
Examples:
- legal language
- medical docs
- manufacturing maintenance logs
- CRM field naming conventions
If your embedding model doesn’t represent your domain well, retrieval accuracy plateaus no matter how much you tune chunk size.
7) Add filters to prevent “technically relevant, practically wrong” hits
A classic failure:
The retriever finds the right concept, but from the wrong time or product.
Add metadata filters:
- effective_date range
- product line
- region
- doc status: draft vs published
The goal is to constrain retrieval to the correct universe.
8) Log every retrieval failure and make it a backlog item
Create a dashboard of:
- queries with low confidence
- queries where the user immediately re-asks
- queries that triggered a fallback
Then review them weekly. This becomes your RAG maintenance loop.
A simple debugging table
- Symptom: “It misses obvious docs” | Likely cause: chunking/metadata poor; BM25 missing | Fix: structure-aware chunking; hybrid retrieval
- Symptom: “It gets the idea but cites wrong doc” | Likely cause: no rerank; weak embedding model | Fix: reranker; domain embeddings
- Symptom: “It finds outdated policy” | Likely cause: no date/status filters | Fix: metadata filters
- Symptom: “It answers confidently with no evidence” | Likely cause: retrieval empty/low recall | Fix: enforce citation requirement + fallback
What most teams get wrong
They treat RAG as a prompt problem
You can write the perfect prompt. It won’t matter if the evidence is wrong.
The contrarian truth: RAG quality is mostly search engineering, not “prompt engineering.”
They increase k and stuff the context window
More chunks can reduce missed hits, but it also increases noise.
Noise creates:
- wrong citations
- diluted evidence
- higher cost
Better retrieval beats bigger context.
They benchmark with toy examples
Your RAG works on the docs you picked. Then it fails on the docs your company actually uses: messy PDFs, tickets, and half-updated SOPs.
Use real questions. Real docs. Real failure cases.
Bottom line
If your RAG system gives wrong answers, assume retrieval is the culprit until proven otherwise.
Measure recall@k, fix chunking + metadata, use hybrid retrieval, rerank, and choose embeddings that match your domain. Then your model has a fair shot.
If you want help building a retrieval eval harness and a production-grade RAG pipeline that’s accurate and cheap, book a call: https://calendar.app.google/fvvhoEcfBzupGyC27
Sources
- https://news.ycombinator.com/item?id=47086383
- https://www.pinecone.io/learn/retrieval-augmented-generation/
- https://blog.langchain.dev/retrieval/
- https://news.ycombinator.com/item?id=43317716
- https://news.ycombinator.com/item?id=45645349
- https://news.ycombinator.com/item?id=46875905
- https://news.ycombinator.com/item?id=37792515
Frequently Asked Questions
I reply to all emails if you want to chat: