TLDR: Blind retry is almost always wrong. Classify the error first, fix what you can, then retry — and narrate loudly so you know what's happening.

The Setup

I built Apollo (my personal AI assistant) on top of an iMessage bridge.

A Python daemon polls chat.db, sees a new message, hands it to a dispatcher that runs claude -p (Claude's headless CLI), and sends the reply back in the thread.

Dead simple in theory.

Less simple when claude -p fails — which it does.

The Wall

The naive answer is just… retry.

And I tried that. A tight try / except, one more attempt, done.

The problem is every failure mode needs a completely different response — and I only discovered this when I watched blind retry make things worse.

  • Auth failure? Retrying 3× is pointless. The call is dead until you run claude auth. All you're doing is spamming a broken process.
  • Rate-limit hit? Immediate retry just gets you 429'd again. You have to wait first.
  • Mac fell asleep? That's a transient glitch — a plain retry is exactly right.
  • SIGTERM (rc 143)? The OS killed it. Figure out why before you do anything.

Same retry loop, four completely different outcomes.

What Actually Worked

The pattern I landed on: diagnose first, fix if you can, then retry. Never retry blind.

The dispatcher now classifies every failure into one of four buckets:

  1. transient — retry as-is
  2. fixable — apply the known fix, THEN retry (e.g. rate limit → sleep 30s first)
  3. unfixable — stop, narrate, surface it to me (_MAX_ATTEMPTS ends here)
  4. unknown — one cheap probe-retry; if it fails twice, give up

_MAX_ATTEMPTS=2. Original attempt plus one retry. That's it.

The code lives in scanner/src/apollo_chat.py_diagnose_failure, _format_retry_narration, _format_giveup_narration — and every narration fires through the existing ZWJ-prefixed _send_apollo_reply so the listener doesn't loop on its own messages.

The Part I Want to Flag

Rate-limit detection is best-effort — and I mean that literally.

The classifier looks for the substrings "rate limit" or "429" in stderr. That's it. There's no structured error object from the CLI, no stable contract. Anthropic can change that wording tomorrow and the fix never fires.

So why is _MAX_ATTEMPTS still just 2? Because detection is unreliable.

If I'm not sure I classified the error correctly, the last thing I want is an aggressive retry loop hammering the same call six times with misplaced confidence. Conservative budget + best-effort detection is safer than precise detection + high retry count. Unknown errors get one probe retry. That's the safety net for whatever the substring matching misses.

Why This Matters to Me

The original feedback note I left myself says it best: "silence on errors is the worst failure mode."

A system that retries without explaining itself is a system you can't trust. Now when the dispatcher hits a wall, it sends me a message — ⚠️ retrying, ❌ giving up, ✓ recovered — through the same iMessage thread I'm already watching.

That narration isn't just a nice-to-have. It's the whole contract.

If you're building anything with retry semantics — dispatcher, background job, scanner loop — default to the diagnose-fix-narrate shape. Copy _diagnose_failure before you write another bare try / except / retry.

P.S. The unknown bucket has already caught two real failures I hadn't anticipated. That one cheap probe-retry on novel errors has paid for itself.