The Qwen3.5 Thinking Budget Trap Nobody Warned Me About

the setup

I've been building a local task ranker for my Apollo Dashboard — a personal cockpit that runs on localhost, reads my Things 3 (my task manager) queue, and ranks everything by priority using a local LLM.

The new backend I was adding: DFlash (block-diffusion speculative decoding, from z-lab), which runs as its OWN OpenAI-compatible MLX (Apple Silicon inference framework) server. Not Ollama. Not an Ollama backend. Its own thing — you pip install dflash-mlx, run dflash serve, and POST to /v1/chat/completions. Important distinction. "Ollama + DFlash" doesn't compose; pick one.

The model: mlx-community/Qwen3.5-9B-4bit with the paired z-lab/Qwen3.5-9B-DFlash draft.

what broke

First inference. Perfect API call. HTTP 200. JSON parses fine.

content: empty string.

finish_reason: "length".

I stared at this for an embarrassingly long time. The model wasn't erroring. It was "succeeding" — it just had nothing to say.

what I tried that didn't work

I know Qwen3 has a soft switch for disabling thinking: add /no_think to the system prompt. I'd used it before with Ollama.

Added it. Re-ran. Same result: empty content, finish_reason: "length".

On this MLX build, the prompt-level switch is completely ignored. The model reads it, nods politely, then thinks the ENTIRE max_tokens budget into reasoning_content anyway. Every. Single. Token. And returns nothing visible.

the fix that worked

The fix lives in the request body, not the prompt:

{
  "chat_template_kwargs": { "enable_thinking": false }
}

That's it. That one key.

Sent it. Got a clean, complete JSON response back. finish_reason: "stop". 24 tasks, ranked properly.

why this matters

Qwen3.5 and Qwen3.6 are reasoning models — thinking is ON by default. If you're using them for non-reasoning tasks (JSON extraction, ranking, classification), they'll blow their entire token budget on hidden chain-of-thought and return you nothing visible. The empty content + finish_reason: "length" combo is the signature. Recognize it.

The prompt soft switch (/no_think) may work in other runtimes. It did not work here. Don't rely on it for MLX builds. Use chat_template_kwargs at the API-body level — that's the only thing that held.

And the payoff once I got there?

DFlash on M4 Max beat Ollama qwen2.5:7b on both speed AND quality. Warm inference ~12.6s / ~63 tok/s vs Ollama's ~17.9s / ~28 tok/s. But here's the part I didn't expect: DFlash ranked 22 of 24 tasks correctly. Ollama 7B got 7.

The quality delta was the real win. 9B > 7B, even without M5's NAX matrix kernels (my M4 Max falls back to slower "steel" kernels — on M5 the gains would be bigger).

Fix the thinking flag first. Then care about speed.

P.S. One more: the serve flag is --draft-model, not --draft. That one got me too.