Why I Stopped Letting Chat Failures Eat My Users' Words

TLDR: Retry with backoff + draft persistence are the two things you add to any chat that talks to an LLM. One handles network flakes. The other means users never lose their words.

The App

I built a chat interface to help patients (and their families) find cancer treatment facilities — a Next.js app backed by an LLM, with facility cards that surface from a TakeShape CMS (a headless content platform).

The stakes felt high. Someone in the middle of a hard conversation about their cancer care should not have to fight the UI.

The Wall

Transient network blips happen. The LLM API times out. The edge function hiccups.

When that happened in the first version, the chat just… failed. A cryptic error, the message gone, the user staring at a broken state. No recovery path.

That's the worst outcome. Not the failure itself — the silence.

What I Tried First (Cycle 11)

My first fix was honest: I added distinct error states with a warning icon and a retry button.

Good! Users could at least see something went wrong. They weren't just staring at a spinner forever.

But the retry button still made them retype everything. Their draft was gone. The experience said: your words don't matter, start over.

That's still not good enough.

The Fix That Actually Worked

Two commits, back to back:

Retry with exponential backoff — instead of surfacing an error on the first failure, the chat silently retries with growing delays. One transient flake doesn't break the conversation.
Draft input persistence — the typed message is saved to localStorage before the send attempt. If the request fails (or the tab reloads, or the session times out), the draft is still there when the user comes back.

Combined: the system recovers automatically on transient errors, and if it can't recover, the user still has their words.

I also added feat(chat): persist conversations across sessions via localStorage in the same window — so even full session reloads don't wipe the conversation history.

Why This Keeps Coming Up

I'd hit this exact shape in two other systems around the same time — my iMessage bridge (Apollo's self-healing dispatcher) and my daily briefing daemon (a scheduled job with 4 retry attempts and a growing backoff array: BACKOFFS=(0 30 90 180) seconds).

Every time, the lesson was the same: classify the error, fix what you can, then retry — don't just show an error and make the human pick up the pieces.

Silent failures make a system feel broken even when it's recoverable.

Why It Matters to Me

The people using this cancer facility finder are not in a relaxed headspace. A failed send that eats their message isn't just a UX annoyance — it's a small betrayal.

Two pieces of code — a retry loop and a localStorage write — are the difference between a system that recovers quietly and one that punishes its users for a network hiccup that wasn't their fault.

Add both. Every time.