Most retrieval-augmented generation systems we audit in 2026 look the same. Pinecone or pgvector. A sensible chunking strategy. Embeddings from a reputable provider. A prompt that stitches the top-k results into context.
And nearly every one of them produces answers that are subtly wrong. Not hallucinations — the LLM is grounded fine. The retrieval itself is returning the wrong chunks.
The real mistake
The system was never evaluated end-to-end with real questions from real users. It was tested with hand-picked examples the developer wrote while building it. Those examples pass. The actual questions users ask often don’t.
This isn’t a failure of the vector DB. It’s a failure of the evaluation loop. Without one, there’s no feedback signal that tells you your retrieval quality is 62% when you assumed it was 90%.
What good evaluation looks like
The minimum viable version is boring and manual:
- Collect 50-100 real questions your system will face. Not your questions. Actual user questions. If you haven’t launched, interview five potential users and write down exactly what they’d ask.
- For each question, manually identify which chunks of your source material SHOULD be retrieved to answer it correctly.
- Run the retrieval step alone on each question. Check: did the top-3 results include the correct chunks?
- If hit rate is below 85%, your problem is retrieval, not the LLM. Fix that before you tune the prompt.
The whole exercise takes a day. Most teams skip it entirely. That’s why most RAG systems in production are quietly worse than they should be.
Where retrieval usually breaks
Three patterns we see repeatedly:
Chunks too big. 2000-token chunks feel “complete” but kill embedding specificity. A single 2000-token chunk might cover five topics, and the embedding averages them into vagueness. Smaller is usually better — 300-500 tokens with 20% overlap is our default.
Chunks too small. Going the other way, 50-token chunks fragment context. The model gets the right chunk back but can’t reason across the boundaries. Test both directions with your actual content.
Wrong distance metric. Cosine similarity is the default, but for some domains (long technical docs especially) it underperforms dot-product or hybrid search. Check whether your vector DB lets you swap it and A/B test on your eval set.
A specific build we shipped
A regulated UK services client needed an AI assistant grounded in 18,000 characters of legislation. First version: standard RAG, 1500-char chunks, cosine similarity, Pinecone. Eval hit rate: 73%.
Changes: 400-char chunks with 80-char overlap. Hybrid search (BM25 + embeddings, weighted 60/40). Re-ranking the top 10 with a cross-encoder before sending top-3 to Claude. New hit rate: 91%.
Same LLM. Same prompt. Same content. The entire quality jump came from the retrieval layer and the eval loop that told us where we actually were.
The short version
If you’re building RAG in 2026, the vector DB is 20% of the work. The evaluation loop is 50%. The prompt is the last 30%. Most teams invert that split and wonder why their system feels off.
If you’ve built RAG and haven’t evaluated it, it’s probably broken in ways you can’t see yet. Happy to take a look if you want a second opinion.