TLDR: When building an attention gate with an LLM, fail-closed beats fail-open. Precision is the differentiator — not recall.
The Apollo Dashboard (my local personal-OS cockpit — a Python server running at localhost that rolls up my tasks, calendar, and email) has an email triage panel.
I wired Claude Haiku (Anthropic's fastest model) into it via the Anthropic Agent SDK on a subscription. It runs hourly, drops results into a state/email.json file, and the dashboard reads it cold at page load.
My first instinct for the default behavior was fail-OPEN: when the model isn't sure, surface the email. Don't miss something important.
That was WRONG.
The Problem With Open
A "needs my reply" view that cries wolf trains you to treat it exactly like your inbox — skeptically, manually, every item.
That's not triage. That's inbox with extra steps and extra latency.
The whole value of an attention gate is trust at a glance. The moment you can't trust the list, you've lost it.
What I Tried First (Spoiler: Bigger Wasn't Better)
My gut said: use a smarter model. More reasoning power = fewer wrong calls.
The benchmark humbled me fast.
Eager flagship models over-surface. They flag everything that might matter. Recall goes to 1.0 and stays there. Precision tanks.
(I'll be honest — my gold set was 20 emails, too noisy to crown a winner statistically. Haiku's precision score swung across identical runs. But the principle survived every round: restraint wins, horsepower doesn't.)
The Fix
The commit message says it:
fix: fail-closed email triage default (precision "needs my reply" view)
If the model is uncertain → don't surface it. Full stop.
I also replaced the blunt needs_attention boolean with a proper 3-bucket system:
- action — needs my reply or a task
- fyi — keep-in-the-loop, no action required
- noise — dropped entirely
Uncertain emails fall to fyi or noise, never into my action view. The Email page became a precision filter. The new /attention area handles fyi — source-agnostic by design, ready for Slack and Notion when those plug in.
Behavioral verification, because that's the only kind that matters:
- Client email: "approve the terms?" → action ✓
- Engineer: "FYI deployed, no action needed" → fyi ✓
- Stripe receipt + Notion digest → noise, dropped ✓
And the learning loop made it sharper without touching a single prompt. I 👎'd a newsletter sender — their routine sends got demoted on the next triage pass. Their "payment failed, suspends today" sibling broke through anyway because the content screamed action. I 👍'd a client — they surfaced reliably every time.
The model got smarter from behavior, not from me writing better prompts.
Why This Matters
There's a difference between an AI that tries to be comprehensive and one that tries to be right.
Comprehensive AI shows you everything it's unsure about. Precise AI stays quiet until it's confident.
For a view I'm supposed to trust at a glance — without thinking, without scanning — quiet confidence wins every time.
The lesson I'm carrying into every triage-style build from here: design for the failure mode that destroys trust, not the one that misses things. A missed email is recoverable. A list I've stopped believing in is not.