Why Prompts Alone Can't Guard an AI Agent's Side Effects

TLDR: When an AI agent has side effects — creating tasks, sending messages, triggering anything — a prompt saying "don't do X" is not a hard stop. You need a code-level gate too. Learned this the fun way.

The Setup

Apollo (my personal AI operating system) runs a background scanner that checks whether my clients' integrations are still authenticated — Gmail, Slack, Google Calendar, Signal. When something needs attention, it creates a task in Things 3, my task manager.

Each client has a different stack. An ecommerce business has Gmail, Slack, Calendar. A Signal-only client? Signal only.

Mostly this works exactly as designed.

The Wall

Then I saw it in Things: slack_[client]: needs_reauth.

I stared at that for a second.

That client doesn't have Slack.

Not "hasn't connected it yet." Doesn't. Use. Slack. Full stop.

But Sonnet (the Claude model powering my scanner) had gone exploring — found a slack_[client] key somewhere in the auth status output, called it a failure, and my code dutifully spun up a reauth task for a channel that does not exist.

And here's the beautiful, terrible part: left unchecked, this loops. Scanner runs. Phantom auth error fires. Task created. I clear the task. Scanner runs again. Phantom fires again...

INFINITE. REAUTH. LOOP. For a service that was never connected.

What I Tried First

My first move was the obvious one: fix the prompt.

I added a ## OUT OF SCOPE — NEVER ATTEMPT section to collector.py and spelled it out explicitly:

A Signal-only client → Signal only. No Slack, Gmail, Calendar, Notion, Drive.
Notion → an ecommerce business-only.
A law firm client uses Google Chat, another client uses Microsoft Teams — neither accessible via this MCP (the tool-connection layer my scanner uses).

And honestly? That helps. The model mostly respects it.

But "mostly" is not a guarantee when the output of a model is what decides whether a real side effect fires. I've trusted prompts too many times and regretted it.

The Fix That Actually Worked

I added a hard IN_SCOPE allowlist in main.py.

IN_SCOPE = {
    "gmail_CLIENT_A", "gmail_CLIENT_B", "gmail_CLIENT_C", "gmail_CLIENT_D",
    "slack_CLIENT_A", "slack_CLIENT_B", "slack_CLIENT_C",
    "calendar_CLIENT_A", "calendar_CLIENT_B", "calendar_CLIENT_C", "calendar_CLIENT_D",
    "signal",
}

Before any auth_status key becomes a Things task, it has to clear that set. Anything not in IN_SCOPE gets logged as "Ignored out-of-scope auth keys" and goes absolutely nowhere.

Belt and suspenders.

The prompt tells the model what to look at. The code decides what's allowed to cause a side effect. Those are two separate jobs — and only one of them should be handed to a language model.

Why This Matters to Me

I'm building more and more systems where AI agents take action, not just generate text. Create a task. Trigger a webhook. Send a message.

The more consequential those actions are, the more I need to stop relying on "the prompt said not to" as my only gate.

Prompt instructions shape behavior under normal conditions. A code-level allowlist enforces hard limits when behavior drifts — and given enough time, it WILL drift.

If your agent has side effects, add the allowlist. It's twenty lines and it'll save you from a morning of phantom tasks.

P.S. The same commit also fixed an unescaped f-string brace bug in the prompt builder that was raising a NameError. So yes — two ways the scanner was broken at once. Fun morning.