live2026-05-26·2 min read

A retriever for agents, not eyes

A search box is built for a human who scans one page of results. A retriever for an agent is a function it calls in a loop. That one difference changes the whole pipeline — and how you prove it got better.

A human issues one query, scans a page, and stops. An agent calls retrieval in a loop — each result changes what it looks for next — until it decides it has enough. So the retriever's job isn't a results page. It's a clean, callable function that returns a summary the model can read directly plus structured pages it doesn't have to scrape.

Once you accept that, the architecture writes itself, and it's forced by physics, not taste. You cannot run a heavy model over the whole corpus. A bi-encoder embeds everything ahead of time and pulls the top ~100 candidates in milliseconds; a cross-encoder then re-reads each of those 100 against the query and reorders them.

corpus
millions–billions
bi-encoder
recall top-100 · ~5ms
cross-encoder
rerank · ~50ms
top 3–5 → LLM
what it reasons over
Fig 1 — retrieve wide with a cheap model, score narrow with an expensive one. The funnel is the point.

Why not cross-encode everything? The numbers are violent.

<100msbi-encoder recall over 40M docs
>50 hrscross-encoding the same 40M for one query

A bi-encoder compresses every meaning of a document into one vector before it sees your query — so it loses information. A cross-encoder feeds the raw query–document pair through full attention, scoring for this query. That's why rerankers reliably add 5–20% NDCG@10 on top of vector recall. But the real reason it earns its keep is subtler:

Now the part people skip: proving the new retriever is actually better. Two configs, two accuracy numbers, ship the higher one — that's how you fool yourself. Both retrievers faced the same questions, so compare them paired. On a fixed set (I use a public QA set), build a 2×2 on the questions where they disagree and run McNemar: only the off-diagonal carries signal.

B correct
B wrong
A correct
a · ignored
b
A wrong
c
d · ignored
Fig 2 — McNemar. The both-right and both-wrong cells are noise; only b and c decide whether the change is real. χ² = (b − c)² / (b + c).

Paired tests matter here precisely because an LLM system gets evaluated once — one expensive run, no cheap retrains to resample. Pairing cancels question difficulty, so a small, consistent win shows up as significant even when the headline accuracy gap looks like nothing. A vibe says "feels better." McNemar says "this many questions flipped your way, and that many flipped against — here's whether to believe it."

A retriever for an agent is a function in a loop, tuned by a paired test. Not a search box, judged by a feeling.