The Reranker We Were Ready to Build — Until the Eval Said No

TLDR: We gated Phase 3 of an Apollo memory build on a real benchmark before writing a single line of reranker code. The gate came back no-go. That was the best possible result.

the plan

Three phases into a memory-system overhaul for Apollo (my personal AI agent, built on Anthropic's Claude Agent SDK).

Phase 1 shipped clean: a recency cache that warms session starts.

Phase 2 shipped clean: /mem-lint, a vault health checker.

Phase 3 on the list: a rerank tier for /mem-search — the skill Apollo uses to pull the right memories from my Obsidian vault before every response.

The pitch made total sense on paper. /mem-search runs BM25 + vec0 vector search + RRF (reciprocal rank fusion — a way to blend keyword and semantic results) to return a ranked list of files. A reranker sits on top and re-sorts the results by relevance. More precision at position 0. Better answers.

Right?

what we built before the build

Before touching reranker code, we built an eval harness.

18 real labeled queries — the exact questions Apollo actually fires during a memory lookup. A gold set. An eval script. Both now live in the project's eval directory.

Then we measured the baseline.

The numbers came back:

top-1: 0.722
top-3: 0.944
MRR: 0.826 (mean reciprocal rank — how high the right answer lands on average)

That 0.944 stopped me cold.

NINETY-FOUR PERCENT of the time, the right file was already in the top 3.

why the gate said no

Here's the thing I'd half-forgotten by Phase 3: /mem-search doesn't pick rank 0 and walk away. It synthesizes over the whole returned set.

So here's the core mismatch: a reranker only reorders top-k, but the consumer never reads the ordering — synthesis gets everything and makes its own call.

Moving the right file from position 2 to position 0 doesn't change what the synthesis model sees. It already had it… every time.

And the cost? The hot path was already running ~4.3 seconds RAG-bound. A rerank call adds latency for zero measurable gain on the actual output.

Gate result: no-go. Don't build it.

what "failed" here

Nothing crashed. Nothing shipped broken.

The failure was the idea itself — killed before it cost us a day of build time. That's the win.

I used to find this out on the other side of the work. Spend a week on the implementation, benchmark it at the end, discover the baseline was already good enough, feel like an idiot.

Here the gate was the first move, not the last check. 18 labeled queries and a Python script surfaced the decision before a single build phase started.

The lesson I'm carrying: don't optimize an ordering your downstream never reads. Find the consumer. Trace what it actually sees. Measure that. Then decide if there's something to optimize.

what lives in the eval directory

The harness isn't throwaway.

If corpus recall degrades — if the top=none rate climbs in telemetry, or the vault gets much harder — we re-run the eval script against the same gold set and let the numbers reopen the gate.

Until then, the gate stays closed.

That's the thing about a good benchmark: it doesn't just kill bad ideas once. It stays in the repo and keeps killing them every time you come back with a new angle and a hopeful look.

P.S. The /mem-lint health checker from Phase 2 turned up a handful of stale cross-links I'd have never found manually. Some phases earn their keep quietly. Phase 3 earned its keep by costing nothing at all.