The One Retry Rule I Wish I'd Learned Before Building Two Systems

TLDR: Centralize your retry/backoff in one shared runner. And never, ever retry a timeout — it just doubles your wait.

The Setup

I was building a morning security scanner — two stages, Sonnet as the collector, Opus (a heavier Claude model) as the judge.

Both shell out to the claude CLI (Anthropic's command-line tool for running model calls from scripts).

Both needed retry logic, because transient failures happen.

So I gave both of them retry loops.

That was my first mistake.

The Problem

Two callers. Two slightly different retry implementations. Two sets of timeout values drifting apart. One had a subtle bug I didn't catch for a week.

The moment I started writing a third caller, I stopped and stared at the screen.

What I Extracted

A single shared _run_claude_cli() function. Every caller goes through it. One place for retry logic, timeout handling, and logging.

Three attempts. Backoff between them.

But here's the part that actually matters.

Not every failure gets a retry.

TimeoutExpired is never retried.

Think about it. If a call was supposed to take 5 minutes and it hit the 300-second watchdog… retrying immediately just doubles your wait. Do that across four attempts and you're looking at a 20-minute wall-clock hang before you even get a failure alert. That's WORSE than just surfacing the error fast.

The rule: classify first, then act. Transient network flake → retry with backoff. Timeout → surface it immediately, no retry.

Evolved in the Daemon

A few months later I applied the same pattern to my CEO daily briefing — a launchd job (macOS's background scheduler) that runs claude every weekday morning to build my day in Things (my task manager).

That one got harder. Four attempts. Growing backoff: BACKOFFS=(0 30 90 180) seconds. Escalating watchdog: TIMEOUTS=(300 300 600 600) — later attempts get more room before the kill signal.

Two lessons built into that design:

Idempotency or bust. Each attempt runs cancel-today first — nukes any existing project before creating a fresh one. Without that guard, four retries would leave four duplicate projects. A retry loop that isn't safe to run twice is a liability, not a feature.
Sometimes the fix is scheduling, not code. Staggering the daemon from 8:00 to 8:10 AM cleared a whole class of cold-start failures. Ten minutes of scheduling beat a month of debugging.

And then there's this: I almost swapped the model entirely. When I reproduced the exact failing invocation at 9:18 AM the briefing ran clean in 82 seconds. The logic was fine. The failures were transient. Don't rip out the component before you've isolated whether the component is even the problem.

Why This Matters

I run systems overnight and over weekends with nobody watching.

When they break, the retry loop IS the on-call engineer.

Getting the classify-then-act shape right — and making sure every retry is safe to run twice — is the difference between automation I trust and automation I'm scared of.

That's worth extracting into a shared runner. Every time.