TLDR: My first two model comparisons lied to me — in opposite directions. One made a good model look broken. One made a bad model look like a winner. Here's the methodology that actually tells the truth.

The Setup

I'm building Apollo — my personal AI operating system (daily briefings, task intelligence, email triage, Obsidian memory search). I was migrating agents off Anthropic to open-source models running via Ollama Cloud, my local GPU proxied to cloud inference.

The obvious question: which model for which job?

I needed a bake-off. I built one — SDK client, triage runners, a blind Opus judge for subjective quality, a hand-labeled gold key for objective scoring. I was proud of it.

And then it lied to me. Twice.

Lie #1: The False Positive

kimi-k2.5:cloud won the CEO daily briefing text bake-off. 8.0 vs. Claude Sonnet's 7.0. Blind Opus judge. Looked clean.

I shipped it.

First 8AM live fire: kimi stubbed the briefing note with "See briefing below" and emitted nothing below it. Called add_todo six times in a loop for the same task. Failed to read Things entirely ("task reader unavailable"). Hung for 14 minutes.

The text quality was genuinely good. But I wasn't deploying a text generator — I was deploying an agent that had to operate tools: read Things, write a real note, exit clean. I never tested that. My eval measured the wrong thing.

Lie #2: The False Negative

Same session, I was testing thinking models for triage and synthesis. They all failed. I concluded OSS couldn't handle the tasks, kept Haiku/Sonnet, moved on.

A week later I found the real culprit: max_turns=1 hardcoded in the harness.

Thinking models spend their single turn on the reasoning block. They never emit the answer. Returns as empty → silent fallback → "OSS is broken." It wasn't. Setting max_turns=4 flipped both tasks to clean viable migrations overnight.

Before you blame the model, rule out your harness.

What Actually Works

I rebuilt the protocol around a rule I put simply: "bake off — then the advisor reviews the results, not your results and then you look again."

The methodology that holds up:

  1. One identical prompt to every entrant. Same context, same tools, same number of passes. If conditions differ, it's a supervised reconciliation — not a bake-off.
  2. Blind judges for subjective quality (Opus, no model labels visible). Objective gold-key scoring for measurable tasks (I hand-labeled 11 real commitments from a call transcript).
  3. Eval the actual production behavior. For agents: did it execute tools correctly? Single clean write? Real note body? Clean exit? Text quality is a different score.
  4. Rule out the harness before you condemn the model.
  5. Match the model to the task's error-cost profile. On action-item extraction (11 real commitments, blind): glm-5.1 caught 9 — best recall, uniquely got the hardest items. kimi-k2.6 caught 5 — precision-biased, 0 false positives but missed more than half. If a missed item is the expensive error, ship the high-recall model. If a false item is expensive, ship the conservative one.

Final Routing

  • Chunker: deepseek-v4-flash:cloud (tied qwen3-coder:480b at 7.75, beat claude-sonnet-4-6 7.25, 6× faster)
  • CEO daemon: deepseek-v4-pro:cloud — not the text winner, but the one that actually operates the tools
  • Email triage: glm-5:cloud — recall task, right error profile

Why This Matters to Me

I almost shipped two wrong conclusions and called them data. The bake-off format feels rigorous — scorecards, blind judges, numbers. But methodology is the whole game. Identical conditions. Right eval criteria. Harness sanity first.

The models were fine. My tests were the problem.

P.S. The harness code is in the Apollo dashboard's ingest directory — DM me if you want the blind-judge runner pattern.