TLDR: My automated CEO briefing kept dying at exactly 8 AM. The logic was fine — it was the clock. Reproduce off-peak first. If it works, you have a contention problem, not a bug.

The Setup

A few weeks ago I built a daemon that wakes up every weekday morning and delivers a full CEO briefing right into Things (my task manager) before I open my laptop.

launchd (macOS's built-in job scheduler) fires run-daemon.sh, which spins up a headless Claude subprocess with a tight tool allowlist — it can read email via Arcade (my email/MCP provider), pull tasks from Things, and check calendar. Can't send anything. Can't write anything except the one Things project it creates.

The output: a project titled CEO Briefing | Thu, Jun 26, full digest in the notes, action items as to-dos, deep links back into Superhuman (my email client). Beautiful when it worked.

What Broke

It stopped working.

Tuesday 6/23: both attempts timed out. Wednesday 6/24: attempt one emitted no sentinel (the success marker the script writes to confirm the run finished), attempt two timed out. Monday 6/22 had barely squeaked through on retry.

Three mornings. No briefing.

The Wrong Instinct

My first thought was the model.

I'm running deepseek-v4-pro:cloud via Ollama — it won a bake-off in mid-June over kimi-k2.5, which botched live tool orchestration, looped add_todo six times, and hung for 14 minutes during the first live fire. deepseek had been rock solid since.

Still — three failures. The instinct to swap the model was LOUD.

I didn't. And I'm glad.

The Real Diagnosis

Instead, I had Apollo (my AI agent) reproduce the exact daemon invocation off-peak — same prompt, same model, same tool allowlist, but at 9:18 AM instead of 8:00 AM.

It succeeded end-to-end in 82 seconds.

That one test told me everything. The briefing logic was healthy. The model was healthy. The problem was the clock.

8 AM is a contention spike. The Ollama server is cold (or loaded). The model hasn't warmed. Something in the stack is congested right at the top of the hour, and a single fixed 300-second timeout with one retry wasn't enough runway to land clean.

The Fix That Worked

Two changes, both in run-daemon.sh and the launchd plist:

1. Replaced the single retry with a bounded 4-attempt backoff loop:

  • BACKOFFS=(0 30 90 180) — each attempt waits that many seconds before firing, so retries land progressively further from the spike
  • TIMEOUTS=(300 300 600 600) — later attempts get more room; the model may just need longer to load on attempt 3
  • Before every attempt: cancel-today runs first, wiping any existing Today CEO project so retries never stack up duplicates
  • Four failures → alert_failure fires a Things task so I know the briefing died. Never silent.

2. Staggered the schedule from 8:00 → 8:10 AM (PlistBuddy, plutil -lint, reloaded via launchctl bootout + bootstrap).

Worst-case wall time is now ~35 minutes. Normal morning: 82 seconds, attempt one.

I ran the loop logic through zsh -n and unit-tested the backoff math — success short-circuits with zero backoff; a 3rd-attempt success correctly applies 30+90 seconds of accumulated wait. Clean.

Why This Matters to Me

The tempting fix was to blame the model and swap it out. That would've been wrong — and I'd have thrown away a system that works for one that might not.

The lesson I'm keeping: when a scheduled job fails on schedule, reproduce it off schedule first. If it succeeds off-peak, you don't have a logic bug. You have a timing problem. Fix the timing — backoff, stagger, more runway — and leave the logic alone.

Diagnosis before retry. Every time.

P.S. The "classify before retrying" rule is one I've been drilling into every background system I build — if you're in Apollo, see my self-healing recovery pattern note in memory. The shape: classify (transient / fixable / fatal / unknown), fix what you can fix, then retry. Blind retries are just expensive noise.