Concepts & Theory
Threat model

Threat model

The shift agents introduce

A classic web app trusts its own code and treats requests as untrusted. An AI agent adds a new, stranger actor inside the trust boundary: a model that takes untrusted input and emits both text and actions. The model is not malicious — but it is steerable by whoever controls its input. That makes three new surfaces hostile.

flowchart TB Att[Attacker-controlled prompt] --> Model Model -->|chooses| ToolArgs[Tool arguments] Model -->|emits| Output[Text / structured output] Model -->|triggers| Destructive[Destructive actions] ToolArgs -.IDOR.-> AppDB[(Your data)] Output -.XSS / exfil.-> UI[Your UI / logs] Destructive -.unauthorized.-> Money[Irreversible effects]

Assets

Other users’ data — reachable via confused-deputy tool calls.
Your UI and downstream systems — where untrusted output is rendered, parsed, logged.
Irreversible side effects — refunds, deletions, outbound messages.
The audit trail itself — forensic evidence that must survive tampering and erasure pressure.

Adversary capabilities (assumed)

Full control of the prompt text, including Unicode tricks (homoglyphs, zero-width, case), and the ability to iterate.
Ability to steer tool-argument selection and output content.
No ability to execute code in your app or write to your database directly. (If they can, this package is not your problem.)

Postures (and which control enforces them)

Surface	Posture	Control
Tool arguments	re-scope owner keys + schema-validate; reject unknown	A
Prompts	normalize → screen → refuse pre-model → append-only audit	B
Output	escape/allowlist HTML, defang markdown, validate structure, redact PII	C
Destructive actions	human-gated approval, fail-closed	D

Fail-safe principle

Every failure path fails closed: a PCRE error blocks, an unavailable approval system denies, a missing config value resolves to the safer state. The reasoning is in untrusted-input posture.

This package defends the deterministic, server-side boundary. It does not make the model itself robust, and it cannot stop an attack that lives entirely in semantics the patterns don’t capture — which is exactly why the append-only audit matters: even an undetected attempt is recorded for forensics.

Last updated: Edit this page

Threat model

The shift agents introduce

Assets

Adversary capabilities (assumed)

Postures (and which control enforces them)

Fail-safe principle

onThisPage