TLDR: HTTP 200 from an AI provider is NOT proof the call succeeded. You have to verify stop_reason == end_turn AND non-empty content — every time. Different providers fail completely differently, and the silent ones will wreck you.

the setup

I run two code-delegate sub-agents inside Apollo (my AI assistant): /minimax, which delegates to MiniMax-M2.7, and /zai, which delegates to GLM-5.1 via Z.AI.

Both ride the same harness — the local claude CLI binary with ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, and the model env vars swapped out. Apollo orchestrates. The delegate codes. Apollo reviews.

The whole point is cheaper inference for bulk coding work, routed intelligently by task context.

It worked beautifully. Until it absolutely did not.

the wall we hit

I was running a larger delegation through /zai — a chunky file context, around 1MB of input.

It came back fast. Four seconds. HTTP 200. No error thrown.

The problem? Nothing had actually been coded.

The response content was empty. The status code said success. The actual stop reason — buried in the payload — was model_context_window_exceeded.

ZAI's GLM-5.1 had hit its context ceiling, politely returned an empty response, and from the outside it looked completely fine.

what I tried that didn't work

I assumed the output was the bug. Wrong prompt framing, maybe. A tool invocation issue. I chased it down the wrong path for longer than I'd like to admit.

(I have a real gift for debugging the thing that isn't broken.)

The actual mistake? I was routing on 200 vs 4xx/5xx like it was a normal REST API.

Totally reasonable assumption for most services. Catastrophically wrong here.

the fix that finally worked

Two checks. BOTH required, every single call:

  1. stop_reason == end_turn — anything else means the model didn't finish cleanly
  2. Non-empty content — a 200 with empty content is a silent failure, full stop

Here's the thing that made this obvious in hindsight…

MiniMax is completely honest about failure. At 2MB input (~512K tokens), it returns a clean HTTP 400 with "context window exceeds limit". You know exactly what happened.

ZAI does not. HTTP 200. Empty response. stop_reason: model_context_window_exceeded. Total silence from the status code layer.

Neither is a better API — they're just different failure signatures. And you have to know each one's specific tell.

the routing rule this built

Now every probe in Apollo's routing logic distinguishes three states — not two:

  • Probe succeeds, provider OK → use it
  • Probe succeeds, provider exhausted or context-blown → fallback
  • Probe fails entirely → fallback too — NOT primary

That third state is the trap that gets people. The naive if/else collapses "probe failed" and "provider exhausted" into the same branch, which means when your probe blows up for mysterious reasons, you silently route to… primary. Again. And nothing tells you.

Being explicit about all three states is the whole fix.

why this matters to me

I'm routing real client work through these agents. A silent failure that looks like success doesn't just waste a few seconds — it produces nothing, logs success, and I find the gap three steps later wondering what went wrong.

The rule is dead simple: HTTP 200 ≠ success.

Verify stop_reason. Verify content. Know your provider's failure signature before you trust it with anything real.

Multi-provider AI routing is POWERFUL. But every provider you add is another failure mode you haven't seen yet.