The Guardrails Teams Deploy, and Where They Crack
Ask a team how they secure their LLM feature and you usually get a list of guardrails. An input filter. An output classifier. A system prompt with firm rules. Maybe a human who signs off on anything risky. This is defense in depth, and it is the right instinct. The trouble is that most of these controls are probabilistic, and attackers do not attack them one at a time.
Here is the honest map of what teams deploy and where each piece tends to give.
Input filters and prompt classifiers
The idea: catch known-bad prompts before they reach the model. Where it cracks: the input space is all of language plus every encoding the model can still read. Paraphrase it, translate it, base64 it, space it out, swap in homoglyphs, or just phrase the request in a way the classifier has not seen. Filters move the bar. They do not set a boundary.
Output filters and moderation
The idea: scan the model’s response and block the bad ones. Where it cracks: get the content out in a shape the filter misses (encoded, split across turns, described instead of stated), or skip the text entirely and cause the harmful action through a tool call before any output is generated. Output filtering assumes the damage is in the words. Often it is in the side effects.
Spotlighting and delimiting
The idea: mark user content clearly so the model treats it as data, not instruction. Where it cracks: the markers are themselves just text. Anti-spotlighting payloads break out of the wrapper, close the delimiters, or carry instructions that survive being quoted. It helps, but it is a convention the attacker can write too.
System-prompt hardening
The idea: tell the model “never reveal this, only answer about X.” Where it cracks: those rules live in the same context window as the attack, with no special authority. Injection competes with them on equal footing, and the longer the conversation runs, the more the rules drift.
Structured output and function schemas
The idea: constrain the model to a fixed set of functions with typed arguments. This one is genuinely stronger, because it shrinks what the model can express. Where it cracks: in what the allowed functions can do, and in how their arguments get validated downstream. The model is contained. The function might not be.
Human in the loop
The idea: a person approves consequential actions. For high-impact, irreversible steps this is the strongest control on the list. Where it cracks: approval fatigue, and approvals that are too broad (“allow this tool” instead of “allow this action”). A human who clicks approve on everything is not a control.
Egress and network control
Underused, and high value. If a tool cannot reach the open internet, a successful injection has nowhere to send what it stole. This is one of the few controls that limits impact instead of trying to predict intent. Use it.
The pattern
Notice what most of these have in common. They try to guess intent from text, and they only fail independently if you are lucky. Real attackers chain. They encode to beat the input filter, wrap to beat the spotlighting, escalate over several turns to beat the single-shot classifier, and aim at the one tool whose function actually does something. Defense in depth only holds if the layers fail for different reasons, and a lot of AI guardrails fail for the same reason: each one is a model judging text.
So we test them the way attackers beat them: one layer at a time, then all at once. That is what AiDx is built for, and it is the subject of the next post.