Why Content Fingerprinting Beat My Subject-Hash Dedup (By a Mile)

Solid catch on the Slack user ID — I'll neutralize that consistently, and keep the two company tokens distinct across the synthetic IDs so the bug illustration stays intact. Writing the cleaned post now.

TLDR: Never build a dedup key out of fields your model generates. It'll regenerate them differently every run. Hash the raw content — the one thing that doesn't change.

The Setup

Apollo (my AI scanner/assistant that turns emails and Slack messages into Things 3 tasks, my task manager) runs hourly.

Every scan it reads my inbox, classifies what matters, and creates tasks.

The whole thing is great — until it isn't.

The First Wall

On April 14th, a single Vercel deploy failure on a webinar game sent me 5 emails.

Apollo saw 5 emails. Apollo made 6 tasks.

Okay. I added generate_subject_dedup_id() — normalize the subject line, strip out hex IDs and dates and noise, hash source:business:sender:normalized_subject. Smart enough, I thought.

Shipped it. Moved on.

The Second Wall (the one that mattered)

April 22nd. One Slack message from a client. Five Things tasks across five hourly scans.

Same message. Five tasks. I stared at this for a minute.

Here's what was happening. The subject-hash approach relied on the LLM to produce a stable raw_id for each message — and Sonnet was generating a different synthetic ID for the same Slack message every single scan:

U06XXXXXXX-[cancer-education-client]-2026-04-22T13:02:00Z
[ecommerce-client]_slack_[cancer-education-client]_U06XXXXXXX_Apr22
slack_[ecommerce-client]_[cancer-education-client]_20260422T130200Z

Three different IDs. One message. Real Slack IDs are just {channel}:{ts} — none of those are real.

And the subject itself? Also regenerated per scan. Different wording. Different normalization target. The business tag was flipping too — this client gets tagged [cancer education] one scan, [ecommerce] the next because she touches both companies.

My hash keyed on all the wrong things. Every field I was hashing was produced by the model, fresh, each run.

The Fix That Worked

I replaced generate_subject_dedup_id with generate_content_dedup_id(source, sender, original_content).

That's it.

Hash the first 500 characters of the raw message body — the thing Sonnet observes, not the thing it synthesizes. Business tag excluded. Subject excluded. The content doesn't change between scans even if everything else does.

Then I tightened the collector prompt: raw_id must match real MCP format for the source ({channel}:{ts} for Slack, thread/msg ID for Gmail). If Sonnet doesn't know it, return an empty string — never synthesize.

Six new tests in state.py to cover same-content collision and business-tag independence.

One message → one task. Problem closed.

Why This Matters

The dedup key is only as stable as its least stable input.

If any field in your hash comes from a model that regenerates it each run — a subject, a tag, a synthesized ID — your key isn't unique. It's noise wearing a disguise.

Fingerprint what's observed. Never hash what's generated.

P.S. If you're building agents that write to any stateful system (task managers, CRMs, databases), this is the class of bug that'll find you eventually. Build your dedup on the immutable source. Everything else is borrowed entropy.