TLDR: At batch LLM scale, a 1–3% flake rate isn't rare — it's guaranteed. Wrap every client.messages.create() call in a retry loop, and strip empty defaults before you trust structured output.

The Setup

I've been building a Shopify storefront cloning skill — give it a URL, it clones the storefront as bespoke Shopify Liquid (Shopify's templating language) and drops an unpublished preview theme straight into the store.

v0.1 used a fixed pattern library: hero, buy box, FAQ, generic fallback.

It looked completely flat and unstyled.

So I moved to Option B: the LLM generates custom Liquid per block, for real. First real clone was pulled from a wellness ecommerce brand's live site, and landed as a preview theme in that brand's store.

And then it kept breaking.

What I Chased First (the wrong stuff)

Raw JSON parsing from LLM responses is brittle — I swapped to tool_use (Claude's structured output mode, forces a valid function call instead of hoping the model wraps JSON correctly). That helped.

4096 tokens was truncating generated sections mid-block — bumped to 8192. That helped too.

But the pipeline was still dying mid-run, and the generated sections were full of empty noise.

I'll be honest: I paid the cost of discovering these in sequence instead of front-loading them.

The Real Fix: Retry Math

Here's the number that changed how I think.

A transient APIConnectionError or APITimeoutError hits ~1–3% of calls. Doesn't sound like much.

But this clone pipeline runs 12 blocks × 1–3 attempts each — that's up to 36 LLM calls per run.

At 2%, that's a guaranteed flake somewhere in almost every batch. Not sometimes. Every. Time.

And the default Anthropic SDK retries do NOT cover APIConnectionError. So one dropped connection crashed the whole pipeline with 11 of 12 sections already generated.

The fix was a 3-retry loop with exponential backoff around every client.messages.create():

for network_attempt in range(3):
    try:
        msg = client.messages.create(..., timeout=180.0)
        break
    except (APIConnectionError, APITimeoutError) as e:
        time.sleep(2 ** network_attempt)
        last_exc = e
if msg is None:
    attempt_errors.append(f"Network error after 3 retries: {last_exc}")

Simple. Three lines around the call. Done.

The Other Silent Killer: Empty Defaults

tool_use forces the model to fill every field in your schema.

Which sounds great — until you realise it means the model fabricates plausible defaults for fields it has nothing to say about, rather than admitting they're empty.

Those empty-looking-but-populated schema settings were slipping through into sections and producing junk Liquid.

The fix: auto-strip any field that matches the known empty-default pattern before you write the section out. Don't trust "it returned something" — verify it returned something real.

12/12

After those two fixes landed together — network retry with backoff, auto-strip on empty defaults — every section in the batch came out clean.

12/12. First time.

Why This Matters to Me

Rare × many = certain. That's the lesson.

I'd built plenty of scripts that call an LLM once or twice and handled errors with a shrug. The moment you're in a batch pipeline — even a small one — the math flips on you. A 2% flake isn't edge-case anymore. It's scheduled.

Any new script I write with client.messages.create() for structured output now ships with tool_use, max_tokens=8192, and the retry loop. Not when it breaks. From the start.

P.S. The strip step is worth building as a separate pass, not inline during generation — it makes the pipeline easier to debug and lets you re-run just the strip without re-calling the LLM.