When Zero Looks Fine: Observability for Silent Background Failures

Two edits needed:

the an ecommerce business supply-chain app → grammatical fix from prior replacement
Both instances of The precall scanner → functional description (same label each time)

TLDR: If your background process returns nothing and throws no error, you have no idea if it worked. Make noise the default — before you do anything else.

the dashboard that lied

We were deep into an ecommerce business's supply-chain app — inventory, forecasting, webinar scheduling, all in one Supabase-backed Next.js project.

The board had five webinars scheduled.

The dashboard said "Upcoming Webinars: 0."

Looked perfectly fine.

That's the trap. Not a red error. Not an exception. Not a flash of wrong. Just... a zero. A HEALTHY-looking zero that nobody questioned.

what I chased first

My first instinct was the query logic. Wrong table? Stale data? Date filter too aggressive?

I spent a good 20 minutes on root-cause theories while the actual problem sat there invisible.

Turned out a PostgREST (Postgres's auto-generated REST API layer) schema-cache reload — triggered by a completely unrelated migration — had started enforcing a foreign-key ambiguity on webinars→topics that had been silently auto-resolving before. PGRST201 was firing on every request.

But this line was swallowing it whole:

const [{ data: upcomingWebinars }] = await Promise.all([...])

It ignores .error entirely. No throw. No log. Just an empty array and a zero on the screen.

what made it systemic

Once I stopped guessing and started looking for the pattern, I saw it everywhere.

ZAI (my LLM sub-agent for code delegation): returning HTTP 200 with stop_reason: "model_context_window_exceeded" and an empty content[0].text. Looked like a response. Was nothing.
Supabase RLS (row-level security, the access-control layer on the database): supabase-js doesn't throw on a denied read — it returns { data: null, error: null }. Completely indistinguishable from "record not found."
My autonomous calendar prep agent (the 6am script that preps calendar briefings before I wake up): dying inside launchd (macOS's background job runner) and writing the real error to a log file nobody was watching. No alert. No TTS. No iMessage. Just silence that looked exactly like success.

Every single one of these had a "success" face on.

the fix that worked

My own notes from that sprint are blunt about it: "Never propose root-cause fixes on top of hidden failures — you'll be guessing."

That's the real lesson. Restore visibility first. Then diagnose from real error data.

For shell commands and background scripts, this is the pattern I use everywhere now:

LOG=/tmp/scanner-$(date +%Y%m%d-%H%M%S).log
<command> 2>"$LOG"
echo "EXIT=$?"
tail -c 4000 "$LOG"

Exit code. Full stderr log path. Inline tail for quick scan. Three things you can actually use — without re-running anything.

For delegate calls to ZAI or MiniMax (my two LLM sub-agents), the post-call checklist:

HTTP 200 ✓ — necessary, not sufficient
stop_reason: end_turn — not model_context_window_exceeded, not max_tokens truncation
content[].text is non-empty — actually check it, don't assume
Files actually changed (if code was supposed to be written, confirm the diff)

And for Supabase — always destructure { data, error } and check both. error: null with data: null is not a success. It's a question you forgot to ask.

why this matters to me

I'm building systems that run while I sleep. My autonomous calendar prep agent fires at 6am so everything is ready when I wake up. The supply-chain dashboard gets checked by people who trust it.

Silent failure is worse than loud failure because it looks like success. A loud failure gets fixed. A silent failure gets shipped.

You don't debug what looks fine. You build on it. And then six migrations later you're staring at "0 upcoming webinars," wondering what went wrong — when the error was there the whole time, routed neatly to /dev/null.

Make it loud first. Then make it right.