The Prompt Is Not a Firewall

Now I have clear guidance. Let me produce the cleaned post.

TLDR: LLMs drift. Prompting them to stay inside a trust boundary is advisory. Code-level firewalls are mandatory.

The Setup

I was building Apollo — my personal AI agent — wired to Arcade (my MCP auth provider, the thing that gates Claude's access to Slack, Gmail, Calendar, and more) across three businesses: an ecommerce business, a second business, and a third business.

The scanner had a collector.py step that asked Sonnet to figure out which channels needed reauth.

I listed the valid channels right there in the auth_status schema. The ecommerce business Slack, the second business Slack, the third business Slack, Gmail, Calendar. A former business had left the picture months earlier. That former business's Slack was definitely not in the list.

That should have been enough.

The Wall

April 24th, 10:29am. Sonnet came back with:

slack_<former-business>: needs_reauth

That channel wasn't in the prompt. It wasn't in the schema. It didn't exist as a channel this agent was supposed to know about.

But main.py trusted the output. It created a Things (my task manager) task: "Fix Apollo Auth: slack_<former-business>."

I caught it. I — the human — caught a ghost task the model invented out of thin air.

What Didn't Work (and Why)

I assumed the schema was the boundary. List the valid channels. Model reports valid channels. Done.

NOPE.

LLMs drift. Sonnet had probably seen the former business in tool configs or earlier context. It surfaced that channel anyway — confident, parseable, formatted exactly like a real result. main.py accepted it as truth because the format was correct.

That's the trap. The model doesn't read "here are the valid channels" and hard-stop. It reads it as guidance and does its best… which, at 10:29 that morning, meant hallucinating a decommissioned business and adding noise to my task queue.

The Fix That Worked

Code-level allowlist at the consumer boundary.

After collector.py runs, main.py now filters against a hardcoded set of the three active businesses. Channel not on the list? Dropped. Doesn't matter what Sonnet reported. The gate doesn't move.

Same principle, different surface: my Apollo iMessage listener reads from ~/Library/Messages/chat.db — and everything in that database looks identical to the agent. SMS threads, group chats, messages from real contacts. All rows in the same table.

So I filter at the SQL level. Only chat.chat_identifier == <my Note-to-Self number> (my Note-to-Self chat) ever counts as a real request. Everything else is external, untrusted, dropped before it reaches the agent. I made that explicit at three layers — code, log, comment — so a runtime audit immediately shows the trust posture.

Why This Matters to Me

There's a framework I keep coming back to called the Lethal Trifecta. When an agent combines sensitive data + untrusted external inputs + reach to take real actions — you've built a full exfiltration path. The defenses against that aren't prompts. They're structural.

The ghost-channel incident was small. A ghost task in Things, nothing more. But the failure mode — model accepts something it shouldn't, downstream code trusts the output, action gets taken — is the exact same shape as something catastrophic in a more dangerous setup.

Decide what your agent is allowed to act on. Write that in code, at the boundary.

The model is not a bouncer.

P.S. — "Prompt-level exclusions alone are not enough" is now a rule in my memory. Hard won at 10:29am on a Tuesday.