TLDR: Your API client timeout must live under your platform's hard ceiling — not above it. And retrying a 4xx just burns the budget twice.
the setup
We were building a cancer specialist finder — a tool that connects cancer patients to specialist treatment centers.
The core feature was an AI chat powered by OpenAI.
Running on Vercel (their serverless functions platform).
the wall we hit
Vercel gives you 60 seconds.
The OpenAI SDK's default timeout? Ten minutes.
So when OpenAI got slow — or just didn't respond — our function didn't fail fast. It sat there. For the full 60 seconds. Then Vercel killed it, the user got a blank error, and the logs told us basically nothing.
It took a full night audit to surface what was actually going on.
what I tried first
My instinct was: add retries.
More retries = more resilient. That's just obvious, right?
WRONG.
We had webhook calls firing to Make.com (a workflow automation platform) that delivered consultation summaries to treatment centers. When Make.com slowed down, those fetches had no timeout at all — they'd pin the function open until Vercel's clock ran out.
And I'd wired up retry logic without stopping to ask: retries on what, exactly?
A 4xx — bad request, wrong auth, malformed payload — will not fix itself on retry. You're just burning the budget a second time and delaying the error message you actually need.
That realization stung a little. I'd added the retries thinking I was being careful.
the fix that worked
Three changes, all simple:
- OpenAI singleton:
timeout: 30_000, maxRetries: 1— fits inside Vercel's 60s with breathing room - Webhook fetches:
AbortControllerwith a 15s per-attempt ceiling — Make.com can't hold the function hostage anymore - Retry filter: skip on 4xx, surface the status code in the failure message so you actually know what broke
the contrast that locked it in
Same week, completely different problem.
An overnight morning briefing automation — a daemon that generates morning priorities — kept timing out on heavy-load mornings.
The fix there was more retries, growing backoff: BACKOFFS=(0 30 90 180) seconds across up to 4 attempts, with escalating watchdog timeouts (300 300 600 600).
The opposite call. On purpose.
Because that daemon has no 60-second ceiling. The platform budget is totally different, so the right resilience parameters are totally different.
Same tool. Opposite settings. Because the context changed.
why this matters
Resilience isn't "add a timeout and some retries."
It's: what ceiling am I inside, what errors can actually be fixed by waiting, and what happens to the platform when I don't answer those questions first?
The OpenAI SDK defaults were designed for scripts running with no time pressure. Vercel serverless is the exact opposite of that. The mismatch was invisible until it wasn't.
Now I set explicit timeouts on every external call — on day one, before anything ships.
P.S. The error logging matters too. "Include status in failure message" was in the same commit. A retry loop with no diagnostic output is just a slower mystery.