TLDR: If your plan's justification is "this should be faster," benchmark that claim FIRST — before building anything around it.
The Setup
I was adding a new ranking backend to my personal productivity dashboard (my local personal cockpit — Things 3 tasks, calendar, email, all in one place).
The backend was a local MLX speculative-decoding server that runs on my M4 Max — my own machine, fully offline, potentially much faster than the Ollama setup I had.
The plan had four phases:
- Stand up the MLX sidecar (the server process)
- Build the sidecar lifecycle manager (start, health-check, SIGTERM reaping)
- Wire in the ranking backend + fallback chain
- Add the UI — labeled dropdown, backend badge, model selector
- Then benchmark it
That ordering felt natural.
It was also WRONG.
The Wall I Almost Hit
I took my plan to the advisor before executing, and they flagged it immediately.
"Your own analysis predicts the payoff here is uncertain. Why is the benchmark in Phase 4?"
And… they were right.
The way the plan was sequenced, I'd have built the entire subsystem — sidecar lifecycle manager, ranking backend with fallback chain, SIGTERM reaping, dropdown UI, backend badge — and then found out whether the MLX backend actually beat Ollama on my real workload.
If it didn't? I'd just built a nice chunk of code for nothing.
What I Changed
I moved the benchmark to Phase 0.
Right after the MLX sidecar first stands up, before a single line of the surrounding wiring exists — that's when you test the claim. That's the cheapest possible moment.
I called it Phase-0 benchmark harness in the commit, because that's exactly what it is: a gate, not a curiosity.
One rule from that session that I've held onto:
Use the real workload. Not a toy input. I benchmarked against the live Things 3 Today set with the exact production prompt. Toy benchmarks miss workload-shape effects — a prompt-heavy task and an output-heavy task respond completely differently to speculative decoding optimizations.
What Happened
The MLX backend passed.
It shipped as the default ranking backend. The sidecar lifecycle manager, the fallback chain, the UI — all of it got built, and it was worth building.
But here's the thing: if it had failed, I'd have spent maybe twenty minutes on the benchmark, surfaced the numbers, and moved on.
Instead of building all of it first, then finding out.
Why This Matters to Me
This is a habit I want to bake in permanently.
Any time a plan phase's justification is "this should be faster" or "this should be better" — and the payoff is genuinely uncertain — I ask: what's the cheapest experiment that confirms the win, and how early can I run it?
Make it a gate. Put it at the front.
The gate doesn't reopen the "build it" decision. It front-loads the data you already wanted to have.
P.S. The advisor catches this kind of resequencing every time. Worth showing it the plan before you execute — not to slow down, but to catch exactly this.