TLDR: If your sync watermark advances before a batch fully commits, you have a silent data-loss trap. Freeze it until the whole batch is done.
the setup
I'm building a supply chain platform for a health supplement ecommerce business — a scheduled cron that pulls orders, subscriptions, and inventory from Shopify and Recharge (their subscription billing platform) into Supabase (our Postgres backend).
Standard pattern: sync runs, writes a watermark of MAX(shopify_updated_at), next run queries updated_at > watermark.
Seemed fine.
It wasn't.
the bug we didn't see
What happens when a batch partially fails?
Say the cron fetches 40 Shopify orders. Writes succeed for 37 of them. Three hit a transient error. The old code still advanced the watermark to MAX(updated_at) of the full batch — including the three that never made it into the database.
Next run starts there. Those three records now sit behind the watermark — Shopify never sends them again unless someone touches them. They're gone. No error. No alert. Just… gone.
That's the trap. A watermark that lies about completeness is indistinguishable from a healthy one.
the fix: SafeWatermark
I pulled the watermark logic into its own module — src/lib/sync-watermark.ts — with one rule: the watermark only advances when a batch fully completes.
A partial failure doesn't crash the cron and doesn't advance the bookmark. It surfaces as HTTP 207 (Multi-Status — meaning "partial success, check the body") instead of a lying 200 OK. Bounded retry with backoff (src/lib/http-retry.ts, max 5 attempts, jitter) handles transient Shopify/Recharge errors before we even reach the failure path.
I pushed the same principle down to the snapshot and refund layers — migration 042 added a partial unique index on source_ref_id=YYYYMMDD so FIFO depletion is idempotent on cron retry, and migration 041 added UNIQUE NULLS NOT DISTINCT (store, shopify_refund_id, line_item_id) so refund upserts dedupe cleanly.
Idempotent skip-if-exists makes reruns cheap. That's the whole deal.
the twist I didn't expect
After shipping, I wrote a detailed data-repair runbook assuming the old code had corrupted production.
Step 1 of the runbook: diagnose the actual damage before touching anything.
Result? Two of three supposed "sync bugs" had caused zero data damage. The FIFO lot subsystem had never been populated — nothing to corrupt. The watermark? SafeWatermark was already holding. The repair for those two was: close the runbook.
I braced for a disaster that the fix had already prevented.
why this matters to me
Every system I build has some version of a watermark — a "we got this far" bookmark. The lesson I can't unlearn: a checkpoint is only meaningful if it moves AFTER the work succeeds. Move it before, and you've traded real data integrity for the appearance of it.
The silent failure is the dangerous one. Make the partial failure loud.
P.S. If your cron returns
200 OKon a partial batch failure, that's the bug — not the batch.