Stop Retrying Your Own Mistakes: How I Fixed My Webhook Retry Logic on Cancer Doctor Finder

TLDR: Don't retry 4xx. A 401 won't fix itself after three attempts — it'll just fail slowly instead of fast.

The Setup

I was building the outbound webhook layer for a cancer education business, a platform that matches patients with treatment facilities.

When a patient has a consultation, the app fires a webhook to each recipient facility — transcript, contact info, the works — into Make (our automation platform), which then routes it into their CRM.

Simple enough. Fire and forget, but with a little resilience.

What I Built First

I pulled the webhook logic into a shared helper, src/lib/consultation-webhook.ts, and wrote a fetchWithRetry function: 3 attempts, linear backoff, try again if anything goes wrong.

And "anything" was the problem.

That first cut retried on every failure — 5xx, 4xx, network timeouts, all of it.

// first version — retried everything, no questions asked
for (let attempt = 0; attempt < retries; attempt++) {
  const res = await fetch(url, options);
  if (res.ok) return res;
  // wait, then loop
}

On the surface this looks fine. Retry = resilient, right?

Not really.

The Wall I Hit

Here's what I didn't think through: a 4xx error is YOUR mistake, not theirs.

401 Unauthorized? Your auth header is wrong. Retrying won't fix it.

404 Not Found? You're hitting the wrong URL. Retrying won't fix it.

422 Unprocessable Entity? Your payload is malformed. Retrying. Won't. Fix. It.

So what was actually happening? A misconfigured CONSULTATION_WEBHOOK_URL (or a bad payload shape during development) turned a fast, obvious failure into a slow, masked one. Three attempts, linear backoff in between, 10–15 seconds of delay… and the same error at the end.

Worse — because the function swallowed errors after retry exhaust, the status code disappeared into the void. I couldn't even tell why it failed.

The Fix That Worked

The fix is two lines of actual logic and a mindset shift:

const res = await fetch(url, options);

if (res.status >= 400 && res.status < 500) {
  // don't retry — this is a client error, retrying won't help
  throw new Error(`Webhook failed with ${res.status} — check URL/auth/payload`);
}

if (!res.ok) {
  // 5xx or network error — worth retrying
  continue;
}

Fail fast on 4xx, surface the status, move on.

Retry 5xx and network errors — those are transient, worth a few attempts.

The commit message was fix(webhook): skip retries on 4xx, include status in failure message, which is basically the whole lesson in a line.

Why the Mental Model Matters

There's a frame I've been using more broadly that clicks here: classify the error before you retry anything.

Transient (5xx, timeout, network blip) → retry as-is
Unfixable (4xx, bad auth, wrong URL) → stop, surface loudly

If you treat every failure as transient, you turn clear errors into delayed errors. And delayed errors are much harder to debug — especially in a webhook chain where the failure context is already one hop removed from where you're looking.

This is the fix I should have shipped in v1. Now it's the default shape I reach for anytime I write retry logic.

If you're building outbound webhooks — check what status codes you're actually retrying on. Odds are you're hammering a wall.

P.S. The include status in failure message half of that commit saved me 30 minutes on the very next debugging session. Log the status code. Always.