The Retrieval Problem

Your semantic search retrieves 20 documents for a user query. When you manually review them, only 3-5 are actually relevant. The AI generates an answer using all 20 — including the 15 irrelevant ones — producing a response that's diluted, unfocused, and sometimes outright wrong.

The problem isn't your embeddings. It's that you're missing a critical second pass: re-ranking.

How Re-Ranking Works

A re-ranking model (also called a cross-encoder) takes each retrieved document and scores it against the original query using full cross-attention. Unlike the embedding model which encodes queries and documents independently, a re-ranker processes them together — giving it a much richer understanding of relevance.

The pipeline becomes: (1) Retrieve top-K documents via embedding similarity, (2) Re-score each document using the cross-encoder, (3) Pass only the top-N re-ranked documents to the LLM for answer generation.

Why Not Just Use Better Embeddings?

Embeddings are optimized for speed — they need to search millions of vectors in milliseconds. This means they use a bi-encoder architecture that sacrifices some accuracy for throughput. Re-rankers don't have this constraint because they only process 20-50 documents, not millions. They can afford to be accurate at the cost of being slower.

Think of it like a funnel: embeddings are the wide net that catches everything, re-ranking is the precision filter that keeps only what matters.

Production Results

35%
Accuracy improvement
+80ms
Added latency
0
Other changes needed

In our testing across legal and pharma deployments, adding a re-ranking step improved answer accuracy by 35% with only ~80ms of additional latency. No changes to embeddings, chunking, or the LLM itself were required.

Implementation Tips

  • Retrieve 20-30 documents in the initial search, then re-rank down to 5-8 for the LLM context window.
  • The re-ranking model can run on CPU — no GPU required for this step.
  • Set a minimum relevance threshold on re-ranking scores. If no document passes the threshold, return "I don't have enough information" instead of hallucinating.
  • Log re-ranking scores alongside query results for ongoing quality monitoring.