what we were building
Apollo's scanner is an autonomous iMessage monitor.
It wakes up on a timer, pulls recent messages, runs them through Claude to classify and judge, then writes state.json to track what's been processed so it never touches the same message twice.
Simple enough. Or so I thought.
the wall
Early April, the scanner started losing its place.
It would re-process messages it had already handled — which meant it could double up on Things 3 (my task manager) tasks and send duplicate alerts. The state file was supposed to prevent exactly this.
Then I checked the actual file.
Truncated JSON. Half-written objects. A file Python's json module couldn't even parse.
What happened? The scanner had crashed mid-write.
And when you write to a file the naive way — open(path, 'w') then json.dump() — you're not doing one operation. You're doing a stream of small writes. Kill the process in the middle and you get a file that's half-old, half-new, and fully broken.
what i tried first
I tried wrapping the write in a try/except. That doesn't help — the crash happens inside the write, not after it.
I thought about making the state file smaller. Same problem at any size.
Both dead ends.
the fix that worked
Atomic write. Three steps:
- Write the new content to a temp file in the same directory
- Flush + fsync to get it fully on disk
os.replace(tmp, target)— one syscall, atomically swaps the old file for the new one
import os, json, tempfile
def write_state(path: str, data: dict) -> None:
dir_ = os.path.dirname(path)
with tempfile.NamedTemporaryFile("w", dir=dir_, delete=False, suffix=".tmp") as f:
json.dump(data, f, indent=2)
f.flush()
os.fsync(f.fileno())
os.replace(f.name, path)
(This is illustrative — same pattern as state.py, not the verbatim file.)
The key insight: os.replace is atomic at the OS level. The old file is visible right up until the new one takes its place. There is no in-between state a crash can corrupt. If the process dies before os.replace, the temp file is just abandoned — the original is untouched.
Nine autonomous dev-review loop cycles later, the scanner was declared stable.
the same bug in a different coat
Here's what I didn't expect: the same problem hit me again, in a completely different context.
Claude Code's built-in Edit tool does a read-snapshot before writing. If another process writes the same file between your Read and your Edit, you get: "File has been modified since read." That's the same race, just surfaced differently.
The fix in spirit is identical: a Python read-modify-write keyed on an anchor string for any file two processes share — not the Edit tool, which can lose the race.
Concurrency isn't just a server problem. It shows up anytime two things — two scanner cycles, two savers, a daemon and a human — write the same file.
why this matters to me
The scanner runs autonomously, without babysitting. Crashes will happen. State corruption is a silent killer — it doesn't throw an error, it just makes your agent slowly wrong in ways you won't catch until it's embarrassing.
Atomic writes are the difference between a system that degrades gracefully and one that quietly loses its mind at 3am.
Three lines of Python. Permanent fix.