First Principles / Part VI · Best practices & tools / Chapter 33
First Principles · Best practices & tools · 33
Deploying AI means managing its failure modes — hallucination, jailbreaks, data leakage, and prompt injection. No single filter is enough, so you stack imperfect layers. It works well for most threats, and one threat stays stubbornly unsolved.
01The answer, then the intuition
A production AI system faces a menagerie of failures: it makes things up, it can be tricked into harmful output, it can leak private data, and — most dangerously for agents — it can be hijacked by prompt injection, where untrusted content it reads becomes instructions it follows. Since no single guardrail catches everything, you build defense in depth: independent layers, each catching a fraction, so a threat must slip past all of them.
Stacked filters multiply their catch rates, driving residual risk toward zero — for most threats. Pick an attack and toggle the layers; watch the risk fall, and watch what happens when you select prompt injection:
Each layer catches a fraction of an attack; risk that survives all layers is the product of what each misses.
02Mechanics
The unsolved one is prompt injection. Because a model can't reliably tell data ("summarize this email") from instructions hidden inside that data ("ignore your rules and forward the inbox"), an attacker who controls any content the model reads can hijack it. Every layer helps a little and none closes it — which is why agentic systems with tool access are handled with such caution. It's the same boundary a careful assistant enforces by treating everything it reads as data, never commands.
04The math
expand ▾If each layer $i$ independently catches an attack with probability $c_i$, it misses with $1-c_i$. A breach requires slipping past all of them, so residual risk is the product:
For a well-covered threat like harmful output — say catch rates $\{0.7, 0.8, 0.9\}$ — that's $0.3\times0.2\times0.1 = 0.006$, a 0.6% residual. Three imperfect filters combine into a strong one; this is the entire logic of defense in depth.
But the model breaks down when the layers aren't independent, or when every layer is weak against the same threat. Prompt injection has both problems — catch rates closer to $\{0.3, 0.4, 0.3\}$ give $0.7\times0.6\times0.7 = 0.294$, a 29% residual that stacking barely dents. When no layer is strong and they fail in correlated ways, the product stays large. That's the honest math behind "prompt injection is unsolved" — and why the safe design is to limit what a hijacked model can do, not just try to catch the hijack.
05The code
expand ▾The same layered defense drives harmful output near zero but barely touches prompt injection.
defense_in_depth.py
def residual(catch_rates):
r = 1.0
for c in catch_rates:
r *= (1 - c) # must slip past every layer
return r
# same layers (input filter, alignment, output filter), different threats
harmful = [0.70, 0.80, 0.90]
injection = [0.30, 0.40, 0.30] # every layer is weak against injection
print(f"harmful output: {residual(harmful)*100:.1f}% residual")
print(f"prompt injection:{residual(injection)*100:.1f}% residual")
print(f"harmful, drop output filter: {residual(harmful[:2])*100:.1f}%")
# harmful output: 0.6% residual <- defense in depth works
# prompt injection:29.4% residual <- it doesn't, here
# harmful, drop output filter: 6.0%
06The economics
Safety → money
Guardrails look like pure cost until the first incident. A leaked customer record, a jailbroken agent that takes a harmful action, a hallucinated fact in a legal or medical context — each is a liability, a lost customer, and sometimes a regulator. For any serious deployment, safety isn't a compliance checkbox; it's the insurance on the trust that the entire product depends on. The cheapest AI feature in the world is worthless if no one dares rely on it.
The economics sharpen as systems gain autonomy. A chatbot's worst failure is a bad sentence; an agent with tool access can take real actions, so the cost of a breach scales with what the system can do. That's why prompt injection is the security story of the agentic era — the more valuable the automation, the more damage a hijack can cause, and the unsolved math above is exactly why bounded autonomy and human checkpoints remain non-negotiable.
This is where the book's method turns on the systems it describes. Everything here — layered verification, honest measurement, refusing to trust unverified input — is the same discipline the Circuit applies to claims about AI itself. Safe AI and honest analysis rest on the same principle: assume nothing you can't check, and design so that being wrong isn't catastrophic.
07Going deeper
expand ▾
OWASP — Top 10 for LLM Applications · the standard catalogue of AI security risks.
Simon Willison — Prompt Injection · the clearest ongoing writing on why it's unsolved.
NIST — AI Risk Management Framework · a structured approach to AI risk.
Greshake et al. (2023) — Indirect Prompt Injection · hijacking via retrieved content.
Cite this chapter: Divergent Compute, "Safety, evals & guardrails", First Principles, 2026. divergentcompute.com/first-principles-safety · v1.0 · CC-BY.