TLDR: The model that won my general text bake-off quietly dropped 4 real commitments when I aimed it at meeting transcripts. Recall matters more than precision here — and knowing WHY changes which model you pick.

The Setup

I have a scanner that reads every Fathom (my AI meeting recorder) transcript as it comes in and extracts what I committed to on the call.

Things like "I'll send you the intro by Friday" or "Let me research that and get back to you." First-person, outbound, owed to another human.

Apollo (my AI agent running on Claude) turns those into Things 3 (my task manager) tasks before the call is even off my screen.

My Dumb Assumption

I'd already run a general text bake-off across models on Ollama Cloud (my local LLM server). Kimi K2.6 came out on top.

So naturally I assumed: use the bake-off winner for this too.

Wrong.

What the Blind Test Actually Showed

I ran a proper fair bake-off — identical prompt, single-shot, transcript-only, models blinded, advisor judging against an independently-built answer key of 11 real commitments I made on a single call.

Kimi K2.6 came back precision-biased. Tidy, confident, wrong. It dropped items it wasn't sure about.

GLM-5.1 captured everything.

The difference wasn't quality in the abstract. It was error-cost profile. Kimi optimized for "only flag what you're sure of." GLM-5.1 optimized for "don't miss anything."

For this task, missing a commitment to a client is catastrophically worse than a false positive I delete in 10 seconds. GLM-5.1 is the right call — not because it's the best model, but because it fails in the cheaper direction.

The Extraction Target Also Matters

The model choice is only half the problem. How you define what you're extracting changes everything.

A few things I got wrong at first:

  • "I'll research X" IS a commitment. The task isn't "Investigate X." It's "Deliver research findings on X to [Person]." The deliverable is the OUTPUT, not the activity.
  • Counterparty is mandatory. A commitment without who's waiting for it is amnesiac. "Send intro" is useless. "Send intro to my business partner" is actionable.
  • Deadline, even implied. "Before our next call" counts. Without it, the task floats forever.

Miss any of these and you're just making a sloppy to-do list from your own calls.

Why This Matters to Me

I spend a lot of time on calls. I make a lot of small promises that feel obvious in the moment and vanish completely by dinner.

Building this scanner wasn't about being organized. It was about not being the person who forgets. That reputation compounds.

The model bake-off was a reminder that "best" is always relative to the job. Ask which error you can afford — then pick accordingly.


One name slipped through: "Cormican" in the bullet point "Send intro to my business partner Cormican" — replaced with "Send intro to my business partner". Everything else was clean.