TLDR: Don't check for required env vars when you need them. Check at boot — before anything else touches them. Silent missing-config bugs are the worst kind.
The Setup
I've been building Apollo, my personal agent system — a scanner daemon that processes requests and gates everything behind an MCP_SECURITY_TOKEN (a static HMAC secret). security.py calls hmac.compare_digest(provided_token, os.environ["MCP_SECURITY_TOKEN"]) on every single request. Airtight, right?
In theory, yes.
The Wall We Hit
The scanner would start fine. Absolutely no boot-time complaints.
Then at runtime: auth failures. Confusing, intermittent, hard to pin down.
Turns out my .env parser was stripping the env file wrong. The value in the file was MCP_SECURITY_TOKEN="abc123" — with quotes around it. The parser dutifully passed "abc123" (quotes included) into the environment. hmac.compare_digest got a quoted string that didn't match the real token. The var was set. It just wasn't right.
No boot error. No warning. Just silent auth rejections at the worst possible time.
The Parallel That Made Me Take This Seriously
Around the same time I was debugging the scanner, I was doing a night security audit on an oncology referral web app for a cancer education business. I found this:
fix(auth): fail closed when ADMIN_EMAIL env var is missing
Both requireAuth() and the middleware had been skipping the admin email check entirely when ADMIN_EMAIL was empty. Not crashing. Not logging. Just… waving any signed-in Supabase user straight through to /admin.
That one hit me hard. A missing env var wasn't just a misconfiguration — it was an open door.
The Fixes
For the scanner, two things:
-
At boot: check for
MCP_SECURITY_TOKENbefore the server starts accepting any connections. If it's missing, crash immediately — loudly, with a clear message. No var, no service. -
At runtime: if auth fails anyway (malformed value, rotation gone wrong, anything), create a persistent task in Things 3 (my task manager) so it surfaces where I actually look, not just in a log file I'll forget to tail.
I also fixed the quotes bug — strip surrounding quotes from .env values — so "set" actually means set.
For the cancer education business: all three gates (API auth, middleware, admin layout) now fail-closed. No ADMIN_EMAIL? Access denied, full stop.
Why This Matters
The pattern here is simple but it took me TWO incidents to really bake it in.
Missing config should fail loud and fail closed. Never silent. Never open.
The practical checklist I now run on every service:
- At startup, enumerate every required env var. If any are missing, refuse to start.
- If a var is present but rejected at runtime, escalate somewhere actionable — not just a log line (a Things task, a Slack message, something you'll actually see).
- Parse your
.envcarefully. Quoted values are a classic silent killer.
The scanner hit me with the quotes. The cancer education business hit me with the missing var letting users in. Both were fixable in five minutes once I knew. But neither made noise until after they'd already done damage.
That's the thing about environment variables — they're boring until they're not.
P.S. The scanner is now declared STABLE after 9 council cycles. Boot-time env validation was one of the last pieces to land. Boring, unsexy, absolutely essential.