TLDR: Transient API failures are guaranteed at scale — it's just math. Retry logic isn't a nice-to-have. And don't advance your watermark on a failed batch.

the thing that kept biting me

I'm not going to pretend I learned this cleanly.

I learned it the same way I learn most things — by shipping something, watching it silently fail in production, and then going back and doing the work I should have done first.

how it showed up in the LLM pipeline

We built a batch pipeline during a Shopify storefront build — running structured LLM calls through the Claude API, 12 blocks, 1-3 attempts each.

That's up to 36 API calls per run.

At a 1-3% transient failure rate, the math is blunt: you're basically guaranteed at least one network flake per run.

The Claude SDK's default retry logic doesn't cover APIConnectionError or APITimeoutError. I didn't know that. So when the network hiccupped mid-batch, the whole pipeline just… stopped. Not on every run. Enough runs that I couldn't ignore it.

The fix ended up being embarrassingly simple:

for attempt in range(3):
    try:
        msg = client.messages.create(..., timeout=180.0)
        break
    except (APIConnectionError, APITimeoutError) as e:
        time.sleep(2 ** attempt)
        last_exc = e

Three retries. Exponential sleep. That's it.

I tell every new pipeline to start with this now. Before I write the happy path.

the scarier version — the silent one

The LLM crash was at least visible.

An ecommerce business supply chain sync had a worse problem: it failed silently.

We're pulling from Shopify and Recharge (a subscription billing API) on a cron, reconciling, and writing to Supabase. The sync would hit a batch error mid-run and… keep going. The watermark would advance — marking work as "done" — even though that batch's records had been quietly orphaned.

No alarm. No 207. No indication anything was wrong. Just data that never made it.

The fix was the SafeWatermark pattern: the watermark ONLY advances when the batch fully completes. A failure surfaces as a 207 partial, the watermark freezes, and the next run picks up from exactly where the last one actually succeeded.

We also pulled out a shared http-retry.ts — max 5 attempts, bounded backoff with jitter, handling 429s and 5xx responses — because Recharge runs on a 2 req/s leak-rate bucket (40-request max) and WILL 429 you eventually under load. Better to anticipate it than discover it.

same lesson, third time

A few weeks later, a daily briefing cron started failing on busy mornings.

Same thing. The daemon had a fixed single retry. On mornings where the model was slow to warm or the network was sluggish, one retry wasn't enough.

Rebuilt it as a 4-attempt loop with BACKOFFS=(0 30 90 180) seconds between attempts and escalating kill timeouts (300, 300, 600, 600) — later attempts get more headroom because late failures usually mean load, not a blip.

It's been solid since.

why this keeps mattering to me

Every time I build something that touches a third-party API, I have two choices: add retry logic upfront or rediscover why it's necessary in production.

The second option is more expensive. Always.

Bound your retries. Back off exponentially. Add jitter under rate limits. And whatever you do — don't advance the watermark on a failed batch.

That last one is the one I almost kept skipping.