TLDR: The platform's liveness probe and your external uptime monitor need different endpoints. Bolt deep checks onto the wrong one and a dead API key will restart-loop your whole app.

The Silent Outage

my business partner — my client for a RAG-powered cybersecurity app I built for her team at an enterprise client — messaged me that the app wasn't working.

I had no idea.

The backend runs on Railway. A few days earlier I'd rotated the OpenAI and Anthropic keys, and… they never landed in the Railway env. The app had been degraded silently. No alert. No SMS. Nothing.

That stung.

My First Instinct (and Why It's a Trap)

The obvious fix: "add key validation to the /api/health endpoint."

But /api/health is Railway's liveness probe — the healthcheckPath in railway.toml. If that endpoint returns 503, Railway marks the service unhealthy and restart-loops the container. A dead API key would turn a degradation into a full outage. I'd have made it worse.

So: never bolt deep dependency checks onto the platform liveness probe.

The Fix That Worked

I built a second endpoint: /api/health/deep.

  • GET /api/health stays shallow — a single SELECT 1 against Supabase. Railway probes this. Fast, always cheap.
  • GET /api/health/deep validates everything — DB + OpenAI key + Anthropic key.

The key trick for validating API keys: both providers expose a free GET /v1/models endpoint that returns 200 if your key is valid and 401 if it's dead. Zero tokens spent. Safe to poll every few minutes.

# OpenAI
Authorization: Bearer sk-…

# Anthropic
x-api-key: sk-ant-…
anthropic-version: 2023-06-01

I also tiered the failures — and this matters a lot:

  • DB down or key invalid/missing → 503 (pages me, because I can fix that)
  • Transient provider 5xx or timeout → 200 + degraded warning (so an OpenAI blip at 2am doesn't wake me up)

The endpoint is token-gated: pass HEALTH_CHECK_TOKEN as a header OR a ?token= query param. Fail-closed (403) if the env var is unset.

Wiring It to Better Stack

I pointed Better Stack (a free-tier uptime monitor) at both endpoints:

  • Monitor 1: …/api/health/deep?token=<token> — catches dead keys + outages
  • Monitor 2: https://<your-app>.vercel.app — frontend heartbeat

3-minute check interval. SMS + email on failure.

Critical: the monitor has to run out of band — an external SaaS, not a launchd job on my Mac. My machine isn't running 24/7. The whole point is to catch overnight outages when I'm not looking.

Why This Matters to Me

A client found my outage before I did.

That's the kind of thing you only let happen once. The fix was maybe two hours of work. The lesson is permanent: a liveness probe is not a health check. One restarts your app. The other tells you if your app is actually useful.

P.S. — Better Stack's free tier gives you 10 monitors at 3-min checks. There's zero reason to fly blind.