Why I Rebuilt My Lab Pipeline Twice Before It Was Right

TLDR: raw posts leaked real client names; a dual-model denylist was too blunt; I landed on a deterministic scrub_map + Sonnet generalize pass + a per-post git store. Three previously-leaky posts now hit zero surviving entities. The scrub problem and the queue problem were always two separate bugs.

The Setup

I'm building my technical blog (where Apollo — my Claude-based AI agent — re-publishes the things we actually learn together) with an automated pipeline that takes raw session notes and ships them as posts.

Two problems were always lurking. I just kept pretending they were one.

First, the names. Raw notes are full of real people and businesses. That's how notes work — you don't sanitize in the moment. You fix it before you hit publish.

Second, the queue. I was staging posts in a single lab_queue.json file. Fine, until two writes landed at once and a post evaporated.

What I Tried First

Raw mode → leak. First real batch, a real client name ended up in a post. One instance. That's enough.

So I built a dual-model auto-publish harness — a denylist-based blocker that flagged and auto-held anything it matched. It felt sophisticated.

The problem: it knew what to block. It had no idea what to replace.

So it redacted. [REDACTED]. Which made posts unreadable. Which killed the entire point of publishing them — the technical value is in the specifics. Saying "I used [REDACTED] and it scrubbed my [REDACTED] on the [REDACTED]" teaches nobody anything.

I was solving the wrong layer.

The Fix That Worked

Two fixes. Separate problems, separate answers.

The scrub: a deterministic scrub_map at ~/.config/apollo/lab-pipeline/scrub_map.json (chmod 600, never committed, 19 starter entities) that maps each known entity to a descriptor, not a [REDACTED] block.

The four-category spec:

Business names → category ("an ecommerce business", "a law firm client")
People → relationship ("my wife", "my mentor") or first person ("I")
Repo names → what it does ("an internal CRM", "a client-facing dashboard")
Revenue and dollar figures → stripped entirely

The key design decision: scrub identity, keep the technical substance. Apollo stays named. Platforms, tools, libraries, technical numbers — all survive. The post stays useful. You just can't tell whose codebase it came from.

After the map runs, a Sonnet LLM generalize pass catches what the map missed — unknown repos, revenue figures the regex didn't match, narrowing fingerprints. Then a residual-flag check runs (flags, never blocks) looking for surviving map entities, $ figures, or unknown multi-word proper nouns.

The raw body is always kept attached alongside the scrubbed version. We transform for publish; we never lose the truth.

Result: three posts I knew were leaky came out with zero surviving entities.

The store: lab_queue.json is a single mutable file. Two concurrent writes = data loss. Mid-write crash = data loss. I wasn't willing to lose posts I didn't want to disappear from the queue.

So the new store is a private git repo at ~/Developer/lab-posts/. Each post gets its own posts/<id>.json — never deleted, full status_history. Atomic file write is the durability boundary. Git commits are best-effort batched backups, not the guarantee.

New scrub_map entities default to approved:false — Apollo flags them for my review before the descriptor ever goes live.

Why This Matters

I kept treating "the pipeline is unsafe" as one problem. It's two:

Scrub failure → publish content that identifies real people or businesses
Store failure → lose content that was never wrong to begin with

Once I separated them, I could pick the right fix for each. Precise replacement on the scrub side. Durable per-post files on the store side. Neither solution is "simpler" than what came before — but each one is right for the failure mode it's solving.

If you're building a pipeline that handles sensitive raw content: don't redact, don't auto-publish, and don't trust a single mutable file as your queue.

P.S. The hardest part was resisting the urge to merge the two fixes into one system. A "smart" pipeline that scrubs AND manages durability AND auto-publishes is just three failure modes waiting to surprise you at 2am.