When Your LLM Guardrail Is Too Good at Its Job

TLDR: A HARD_RULES block I wrote to protect financial data was killing aggregate financial questions too. The fix wasn't removing the rule — it was scoping it precisely.

The Setup

I've been building an internal docs chat that lives on top of a canvas of business processes.

Events. Orders. Registration flows. Zoom calls. Support workflows. The whole map.

The point of the chat is simple: someone asks a question about how we run the business and gets a real answer. Fast.

So naturally, the first thing I did was write guardrails — HARD_RULES in the system prompt, hard-coded constraints to keep the model from surfacing things it shouldn't.

Good instinct. Terrible execution.

What I Over-Tightened

Rule 2 was a broad block on financial data.

My reasoning? Solid. I've learned the hard way — on more than one system — that you cannot trust a model to self-police when sensitive data is sitting right there in context. If the number is in the prompt, the model will eventually find a way to say it out loud.

So I wrote a rule. A blunt one.

...and then someone asked "what were total event registrations and orders this month?"

Stonewalled.

Completely legitimate business question. Totally reasonable thing to want from a docs chat. And my guardrail just killed it.

The Fix

One commit. One line in the rule.

narrow HARD_RULES rule 2 — allow aggregate financial data

That's it.

Individual record? Blocked. What a customer paid for ticket #4892? Not happening.

Aggregate summary? Totally fine. Total revenue for the month? Total orders for an event? The system should answer those without blinking.

The distinction was obvious in hindsight. But when you're in "lock it ALL down" mode, you write blunt rules. And blunt rules are indiscriminate.

Why I Keep Walking Into This

Here's the honest part.

My default instinct on LLM guardrails is always tighten, never trust the model. That's documented. That's proven. I've got scanner agents where prompt-only enforcement drifted and the code-level filter saved us. I believe in defense in depth.

But there's a difference between tight and blunt.

Tight means "blocks the bad thing, only the bad thing."

Blunt means "blocks the whole category because you couldn't be bothered to define what you actually care about."

A rule that says "no financial data" treats show me customer payment details and how many orders did we run last quarter? as the SAME query. They are absolutely not the same query. One is a privacy violation waiting to happen. The other is a Tuesday morning ops check.

If you can't express the distinction in the rule itself, you haven't done the thinking yet.

Why This Matters to Me

I ship chat systems into businesses that depend on them for real decisions.

A chat that stonewalls reasonable questions isn't a safety feature — it's a broken product. The guardrails have to be precise enough to block what's dangerous without strangling what's useful. That's harder than writing a broad rule. It takes five more minutes and an actual opinion about where the line is.

Write the line. Don't just draw a wall.