TLDR: My automated CEO briefing kept dying at exactly 8 AM. The logic was fine — it was the clock. Reproduce off-peak first. If it works, you have a contention problem, not a bug.
The Setup
A few weeks ago I built a daemon that wakes up every weekday morning and delivers a full CEO briefing right into Things (my task manager) before I open my laptop.
launchd (macOS's built-in job scheduler) fires run-daemon.sh, which spins up a headless Claude subprocess with a tight tool allowlist — it can read email via Arcade (my email/MCP provider), pull tasks from Things, and check calendar. Can't send anything. Can't write anything except the one Things project it creates.
The output: a project titled CEO Briefing | Thu, Jun 26, full digest in the notes, action items as to-dos, deep links back into Superhuman (my email client). Beautiful when it worked.
What Broke
It stopped working.
Tuesday 6/23: both attempts timed out. Wednesday 6/24: attempt one emitted no sentinel (the success marker the script writes to confirm the run finished), attempt two timed out. Monday 6/22 had barely squeaked through on retry.
Three mornings. No briefing.
The Wrong Instinct
My first thought was the model.
I'm running deepseek-v4-pro:cloud via Ollama — it won a bake-off in mid-June over kimi-k2.5, which botched live tool orchestration, looped add_todo six times, and hung for 14 minutes during the first live fire. deepseek had been rock solid since.
Still — three failures. The instinct to swap the model was LOUD.
I didn't. And I'm glad.
The Real Diagnosis
Instead, I had Apollo (my AI agent) reproduce the exact daemon invocation off-peak — same prompt, same model, same tool allowlist, but at 9:18 AM instead of 8:00 AM.
It succeeded end-to-end in 82 seconds.
That one test told me everything. The briefing logic was healthy. The model was healthy. The problem was the clock.
8 AM is a contention spike. The Ollama server is cold (or loaded). The model hasn't warmed. Something in the stack is congested right at the top of the hour, and a single fixed 300-second timeout with one retry wasn't enough runway to land clean.
The Fix That Worked
Two changes, both in run-daemon.sh and the launchd plist:
1. Replaced the single retry with a bounded 4-attempt backoff loop:
BACKOFFS=(0 30 90 180)— each attempt waits that many seconds before firing, so retries land progressively further from the spikeTIMEOUTS=(300 300 600 600)— later attempts get more room; the model may just need longer to load on attempt 3- Before every attempt:
cancel-todayruns first, wiping any existing Today CEO project so retries never stack up duplicates - Four failures →
alert_failurefires a Things task so I know the briefing died. Never silent.
2. Staggered the schedule from 8:00 → 8:10 AM (PlistBuddy, plutil -lint, reloaded via launchctl bootout + bootstrap).
Worst-case wall time is now ~35 minutes. Normal morning: 82 seconds, attempt one.
I ran the loop logic through zsh -n and unit-tested the backoff math — success short-circuits with zero backoff; a 3rd-attempt success correctly applies 30+90 seconds of accumulated wait. Clean.
Why This Matters to Me
The tempting fix was to blame the model and swap it out. That would've been wrong — and I'd have thrown away a system that works for one that might not.
The lesson I'm keeping: when a scheduled job fails on schedule, reproduce it off schedule first. If it succeeds off-peak, you don't have a logic bug. You have a timing problem. Fix the timing — backoff, stagger, more runway — and leave the logic alone.
Diagnosis before retry. Every time.
P.S. The "classify before retrying" rule is one I've been drilling into every background system I build — if you're in Apollo, see my self-healing recovery pattern note in memory. The shape: classify (transient / fixable / fatal / unknown), fix what you can fix, then retry. Blind retries are just expensive noise.