How I Learned to Block the Call, Not Just Filter the Output

TLDR: Telling an LLM not to use a tool is not the same as preventing it from using the tool. Enforce scope at two layers — prompt AND harness — or the first layer will eventually fail you.

The Setup

Apollo's CEO briefing daemon runs every morning, fully headless.

It reads my real inbox, pulls from Fathom (my meeting-notes platform, my MCP tool provider for calendar and comms), and surfaces what I need to act on before 7am.

The problem: the same business-tools MCP server that feeds it read-only context also exposes send_email, send_slack_message, and a handful of other write channels for three different companies.

Reads untrusted email. Has write tools. Runs autonomously. That's the lethal trifecta.

My first instinct was the obvious one: write a tight prompt. List the exact tools the agent should use, forbid everything else. Done. Shipped.

What Broke (and proved the instinct wrong)

A few weeks earlier I had the same instinct with Apollo's auth scanner — a separate agent that polls for expired credentials across those same MCP channels.

The collector.py prompt was explicit: here are the valid channels, do not report on anything else. Sonnet drifted anyway. On 2026-04-24 it surfaced slack_[client]: needs_reauth — that client wasn't in the prompt at all. main.py took the output at face value and created a "Fix Apollo Auth: slack_[client]" task in Things 3, my task manager.

I only caught it because the task looked weird. One noisy phantom task is low stakes. An autonomous agent with a send button is not.

The Two-Layer Fix

For the scanner I added a code-level allowlist at the consumer boundary — filter the LLM's output before it turns into actions. That stopped the scanner from acting on drift, even when the model drifted.

For the briefing daemon I went a step further: block the call entirely at the harness layer.

The key flags when invoking claude -p:

Drop --dangerously-skip-permissions — that flag equals bypassPermissions, which makes any allowlist you write completely irrelevant. Everything auto-approves. I had this on "for convenience." It had to go.
--permission-mode dontAsk — deny-by-default. In headless -p, a non-allowed tool aborts cleanly. No prompt, no hang, no question. Just: nope.
--allowedTools and --disallowedTools — pass each as a single comma-joined string, not space-separated variadic args. Variadic form risks --allowedTools swallowing --disallowedTools entirely, silently voiding your deny block. This one bit me before I read the docs carefully.

MCP tool names follow the pattern mcp__<server>__<tool> — e.g. mcp__business-tools__send_email. Globs work after the server prefix (mcp__business-tools__send_*), which is how I block the whole write surface in one line.

Empirically verified: the briefing ran the next morning, covered everything it needed, and the send tools never fired. dontAsk denied them cleanly.

Why This Matters

The real lesson isn't about flags.

It's that prompt-only is soft — you're relying on model compliance, and models drift. And harness-only is blind — the model won't self-correct if it doesn't know the rule.

Both layers together: a model that tries to stay in scope AND a guardrail that makes drift a no-op.

If you're shipping any LLM agent where the tool surface is wider than the agent's intended scope — and in practice, it almost always is — add the code-level block from day one. Don't wait for the incident to prove the pattern.

P.S. The two-part rule I keep coming back to: allowlist without framing and the agent doesn't know the tool exists. Framing without allowlist and you get "tool not permitted." Both, always.