TLDR: In any processing loop, "transient" and "terminal" are completely different states. Only terminal states should block re-processing — and they definitely shouldn't change your fetch scope.

the setup

I've been building autonomous processing loops — agents that run on a timer, pull new data, act on it, and disappear without me touching anything.

The one that bit me hardest is a Fathom (my call-recording platform) scanner: it polls for recent transcripts, extracts commitments I made on calls, and turns them into tasks. Runs every few minutes via launchd (macOS's job scheduler). I never see it. It just works — until it doesn't.

the actual state machine

The scanner has two genuine dedup layers:

  • Recording-level: processed_ids["fathom:<recording_id>"] — a call goes in here once it's fully handled. Never touched again.
  • Commitment-level: processed_ids[generate_commitment_dedup_id(counterparty, deliverable)] — if I owe the same thing to the same person across two different calls, it collapses to one task.

And then there's a separate transient mechanism: pending_transcript. Fathom's API sometimes returns a call before the transcript is ready. Those land here with a retry count — up to 12 retries (3 hours total). After that, they graduate to processed_ids and are skipped forever.

Seemed airtight. The two terminal layers handle dedup. The transient layer handles waiting.

what actually broke

On April 29, launchd missed its bootstrap. The scanner just… stopped. No alert. No error. Logs went quiet.

I didn't notice until May 13.

Here's the part that really stings. The state file had last_poll = 2026-04-29 and one stale entry in pending_transcript. And the code had this branch:

if pending:
    after_date = now - timedelta(days=3)
else:
    after_date = state["last_poll"]

The logic was well-intentioned: if something is pending, widen the window so we can retry against fresh API data. Fine idea in isolation. But when the scanner restarted on May 13, it saw that one stale pending entry and widened to 3 days back — May 10.

April 30 through May 9. Gone.

I had to run --backfill 14 manually to fill the gap. The overlap days were fine (the dedup layers caught everything twice). The missed window is just gone — anything I didn't catch in that manual backfill, I lost.

my first instinct was wrong

I immediately wanted to blame the launchd outage. And yes — that's where I focused first. Fix the bootstrap. Add a health check. Done.

But that's treating the symptom. The real bug was that a transient state was silently changing the fetch scope. A pending recording isn't done. It's waiting. And I'd let "waiting" act like a rewind button on the entire polling window.

the rule I'm coding to now

I can't point you to a clean one-commit fix on the Fathom scanner — the immediate remedy was the --backfill 14. But I applied the same hard-won lesson to the precall prep agent (the other loop I was building at the same time): fix(scanner): precall — only 'wrote' outcome blocks re-processing.

That commit name says everything. Previously, intermediate states were blocking re-runs when they had no business doing so. Only a confirmed write — a terminal, verifiable outcome — should close the loop on a record.

The principle: terminal states and transient states must not share responsibilities.

Terminal states (wrote, processed, skipped-after-max-retries) → block re-processing. Transient states (pending, retrying, in-flight) → stay transparent. They're waiting, not done. Don't let them anchor your fetch cursor or widen your window or skip records downstream.

And there's one more thing I love about this design: prune_processed(7 days) sweeps both terminal layers at every scan startup. Even "done" isn't forever. A commitment that hasn't shipped in a week can re-surface as a task — if I re-mentioned it on a call, that's signal I still haven't done it.

State machines that know when to forget are almost as important as the ones that know when to remember.