The API Won't Tell You When You're Paying Twice

TLDR: Two separate gpt-4.1-mini calls doing the exact same extraction, added weeks apart. The API processed both happily. No error. No warning. Just a bill.

The Setup

I've been building a cancer treatment facility finder — a chat interface that helps cancer patients find treatment facilities.

The core flow: user chats, the app collects their contact info (name, email, phone, health situation), then surfaces matching facilities with a HubSpot form (my CRM) pre-filled.

Two separate things needed that extracted contact info.

So... I wrote two separate extractions.

What I Actually Built (Without Realizing It)

The first one made sense: before we surface facilities, we need to know if we have a complete lead. So the chat route called gpt-4.1-mini to pull { fullName, email, phone, healthIssue } from the conversation and check complete === true.

A few days later I needed that same data on the client side — to pre-fill the HubSpot form.

So I added another gpt-4.1-mini call.

Same model. Same conversation context. Same JSON schema. Just… lower in route.ts, after the main response was assembled.

Why I Didn't Notice

The API returned 200 OK on both.

Both responses were valid, well-formed JSON. Both extractions were correct. Neither one threw an error or took longer than usual.

There is absolutely nothing in an LLM API response that says "hey, you ran this same extraction 40 lines ago."

It's not like a database where a duplicate write might throw a constraint violation, or a cache that returns a HIT. The model has no memory of your last request. It processes every call fresh. And it bills you the same way.

I only found it because I went looking at route.ts during a performance pass and noticed the shape of the second call looked suspiciously familiar.

// FIRST call (~line 80) — checks if lead is complete
const extractRes = await openai.chat.completions.create({
  model: "gpt-4.1-mini",
  response_format: { type: "json_object" },
  messages: [ /* full conversation */ ]
});

// ... 150 lines of business logic ...

// SECOND call (~line 240) — "extracts lead info for client-side storage"
const extractRes = await openai.chat.completions.create({
  model: "gpt-4.1-mini",  // same model
  response_format: { type: "json_object" },
  messages: [ /* same conversation */ ]  // same input
});

Yeah.

The Fix

Two lines added to the first pass — capture the result while we already have it:

if (hasLeadInfo && extracted.fullName && extracted.email) {
  extractedLeadInfo = { fullName: extracted.fullName, email: extracted.email,
                        phone: extracted.phone, healthIssue: extracted.healthIssue };
}

Then delete the second call entirely.

The fix was 4 lines added, 27 lines removed.

Why This Happens (and Will Happen to You)

Features get added incrementally. The first extraction was "check completeness." The second was "populate the client." They felt like different jobs, so they got built separately.

The danger is that LLM routes don't have the natural checkpoints other code has. A SQL ORM complains about bad queries. A type system catches shape mismatches at compile time. But two semantically identical LLM calls in the same request? Total silence. Both succeed.

What I do now: before shipping any chat route, I count the LLM calls explicitly. I want to know: how many model calls does one user message trigger? If I can't answer that from memory, I go read the file.

That number should surprise you less than it surprises the billing dashboard.

P.S. The same pattern shows up at scale. When you fan out parallel LLM agents across a workload, add a label or trace ID per logical task — so you can verify each agent ran once, not twice. The API won't complain either way.