The Silent Truncation: What `stop_reason` Taught Me About LLM Token Limits

TL;DR: max_tokens=4096 silently truncates complex structured outputs. The model calls the tool, you get an HTTP 200, and the field is just… empty. Log stop_reason. Always.

The Setup

I was building a Shopify template generator skill — a generator that asks Claude to produce a full Shopify Liquid template (Shopify's templating language for storefronts) complete with inline CSS, a schema block, and metadata — all returned as structured tool output in one shot.

It worked great on simple templates.

Then I pushed it toward anything page-level — a full hero section, a featured-products grid — and it started silently breaking.

The Wall

The tool call was succeeding.

HTTP 200. No exception thrown. The tool_use block came back just fine.

But the liquid field? Empty string.

So naturally I went looking in my own code. Prompt phrasing wrong? Schema definition off? Maybe the model was returning something I wasn't parsing correctly?

I spent time in the wrong layer entirely — debugging my code while the problem was happening inside the model, before my code ever ran.

What Was Actually Happening

The model was hitting max_tokens=4096 and stopping mid-generation.

Not with an error. Not with a warning. It just… stopped. Filed the truncated tool call anyway. Handed me an empty field and moved on.

The clue was msg.stop_reason.

When generation completes normally you get end_turn. When the model runs out of room it returns max_tokens — and the output is whatever managed to fit. In my case, a dense template (2KB of Liquid + 1KB of inline CSS + schema block) was pushing past 4096 output tokens and the liquid field was the last thing being written. It got cut to nothing.

The model wasn't confused by my prompt. It was out of runway.

The Fix That Actually Worked

Two things, in order:

Bump max_tokens to 8192 by default for any template-generating task. Go to 16384 if you're approaching page-level output — the Anthropic API supports it, the cost difference is negligible compared to the debugging time you save.
Log stop_reason on every LLM call, especially in error paths. If stop_reason !== 'end_turn', that's your signal — not an empty field, not a parse error, not a prompt problem.

While I was in there I also switched from asking the model to return raw JSON to using tool_use (Claude's structured tool-calling API). The difference: with JSON mode, a malformed output is your problem to recover from. With tool_use, the model is forced to retry if the schema doesn't validate. It made the whole pipeline meaningfully more reliable.

The Callback That Confirmed It

A few days later I was building a field mapping script — an LLM-driven Airtable-to-Supabase migration tool for a law firm client's move to a new practice management system.

The script was asking Claude to produce a dense JSON mapping across a wide schema. It was truncating.

This time I knew immediately. Bumped max_tokens to 16000, done. No debugging spiral. The commit message even says it: fix: bump max_tokens to 16000 on the mapping script.

That's the whole value of learning a pattern. The second time costs you nothing.

Why This Matters to Me

I've made it a global rule in my own tooling: HTTP 200 ≠ success. You have to verify stop_reason == end_turn AND that the content field is non-empty before you trust a result.

Silent failures are the nastiest kind. They look like your bug when they're actually the model's constraints. The fix — logging stop_reason, setting a real max_tokens ceiling — takes ten minutes. The debugging spiral it prevents can eat a morning.

P.S. The default max_tokens=4096 made sense when outputs were short. Once you're asking models to generate real structured content — templates, mappings, anything with nested schema — that default is a trap. Treat 8192 as your new floor.