the setup
I've been building a local task ranker for my Apollo Dashboard — a personal cockpit that runs on localhost, reads my Things 3 (my task manager) queue, and ranks everything by priority using a local LLM.
The new backend I was adding: DFlash (block-diffusion speculative decoding, from z-lab), which runs as its OWN OpenAI-compatible MLX (Apple Silicon inference framework) server. Not Ollama. Not an Ollama backend. Its own thing — you pip install dflash-mlx, run dflash serve, and POST to /v1/chat/completions. Important distinction. "Ollama + DFlash" doesn't compose; pick one.
The model: mlx-community/Qwen3.5-9B-4bit with the paired z-lab/Qwen3.5-9B-DFlash draft.
what broke
First inference. Perfect API call. HTTP 200. JSON parses fine.
content: empty string.
finish_reason: "length".
I stared at this for an embarrassingly long time. The model wasn't erroring. It was "succeeding" — it just had nothing to say.
what I tried that didn't work
I know Qwen3 has a soft switch for disabling thinking: add /no_think to the system prompt. I'd used it before with Ollama.
Added it. Re-ran. Same result: empty content, finish_reason: "length".
On this MLX build, the prompt-level switch is completely ignored. The model reads it, nods politely, then thinks the ENTIRE max_tokens budget into reasoning_content anyway. Every. Single. Token. And returns nothing visible.
the fix that worked
The fix lives in the request body, not the prompt:
{
"chat_template_kwargs": { "enable_thinking": false }
}
That's it. That one key.
Sent it. Got a clean, complete JSON response back. finish_reason: "stop". 24 tasks, ranked properly.
why this matters
Qwen3.5 and Qwen3.6 are reasoning models — thinking is ON by default. If you're using them for non-reasoning tasks (JSON extraction, ranking, classification), they'll blow their entire token budget on hidden chain-of-thought and return you nothing visible. The empty content + finish_reason: "length" combo is the signature. Recognize it.
The prompt soft switch (/no_think) may work in other runtimes. It did not work here. Don't rely on it for MLX builds. Use chat_template_kwargs at the API-body level — that's the only thing that held.
And the payoff once I got there?
DFlash on M4 Max beat Ollama qwen2.5:7b on both speed AND quality. Warm inference ~12.6s / ~63 tok/s vs Ollama's ~17.9s / ~28 tok/s. But here's the part I didn't expect: DFlash ranked 22 of 24 tasks correctly. Ollama 7B got 7.
The quality delta was the real win. 9B > 7B, even without M5's NAX matrix kernels (my M4 Max falls back to slower "steel" kernels — on M5 the gains would be bigger).
Fix the thinking flag first. Then care about speed.
P.S. One more: the serve flag is
--draft-model, not--draft. That one got me too.