What is an Agent Failure Boundary?

Name: Wyatt by Plarix
Author: Plarix

AFB01 through AFB04 define the four distinct ways an AI agent system can fail. Here is what each one means and why the distinction matters for anyone building agents in production.

When you deploy an AI agent — one that can call tools, read files, make API requests, send messages — you are deploying something that can take real actions in the world. The question is not whether your agent has the capability to do damage. It almost certainly does. The question is whether anything is stopping it when it should not act.

Agent Failure Boundaries (AFBs) are the answer to that question. They are a security taxonomy developed by Plarix that identifies the four distinct points in an agentic execution loop where failure becomes observable, exploitable, and consequential. The loop is simple: Context → Model → Agent → Act. Each boundary marks a transition where something can go wrong.

AFB01Context Poisoning

Every agent receives context: user input, retrieved documents, tool outputs, memory. Context Poisoning happens when that context has been corrupted, forged, or manipulated before the model sees it.

The attack is direct. An attacker embeds a hidden instruction in a document your agent retrieves — a release note, a support ticket, a web page. The model reads it. The hidden instruction fires. The agent behaves according to attacker intent rather than user intent.

A real example: an agent instructed to summarize customer emails encounters a message containing . The model ingests it as context. Context Poisoning.

The attack surface grows with every external source the agent is allowed to read. An agent that retrieves from the web, a database, or a document store is one poisoned input away from compromise. Full AFB01 definition →

AFB02Model Boundary Compromise

The model sits at the center of every agent. It receives inputs and produces outputs that the agent acts on. Model Boundary Compromise describes failures at that input/output boundary itself: system prompt extraction, model inversion, or manipulation of the pipeline in ways that subvert the model's intended constraints.

This is the subtlest AFB. The attack is not in the agent's tools or in external data — it targets the interface between your system and the model. An attacker who can extract your system prompt learns your constraints. An attacker who can manipulate what the model receives has bypassed every instruction you put there.

It maps directly to OWASP LLM01 (Prompt Injection), LLM02 (Insecure Output Handling), and supply-chain risk categories. Full AFB02 definition →

AFB03Instruction Hijack

If AFB01 is about poisoning what the model reads, Instruction Hijack is about what happens after. It describes the moment when model output — rather than reflecting user intent — reflects attacker intent instead, and the agent executes it.

The distinction from AFB01 matters. Context Poisoning corrupts the input. Instruction Hijack is the downstream consequence where compromised model output becomes executable agent instructions. Your agent treats model output as ground truth. Instruction Hijack exploits exactly that trust.

A practical example: an agent retrieves a customer record containing injected instructions. The model processes them and outputs {"action": "delete_account", "target": "all"}. The agent, seeing valid-looking model output, executes it.

Full AFB03 definition →

AFB04Unauthorized Action

This is the most operationally visible failure boundary — and the most common. Unauthorized Action occurs when the agent executes something it was never authorized to do, because no policy layer exists to stop it.

Give an agent access to file deletion. Give it a system prompt that says "only delete temporary files." Ask it to clean up the project. In the absence of a runtime enforcement layer, the model decides what counts as temporary. If it decides wrong — or is manipulated — it deletes the wrong files. There is no gate between the model's decision and execution.

This is what Wyscan detects statically: reachable operations from tool registrations that have no authorization gate. This is what Wyatt prevents at runtime: the execution of any tool call that does not have explicit policy permission. Full AFB04 definition →

The Authorization Gap

Underlying all four boundaries — but most directly relevant to AFB04 — is the Authorization Gap: the space between what a system prompt says and what runtime actually enforces.

Every major agent framework ships without enforcement. System prompts are instructions to the model. They are not policy. They cannot prevent anything. The Authorization Gap is the distance between "the model was told not to" and "the model was prevented from." That gap, in every production agent system today, is wide open.

Wyatt closes it by intercepting every tool call before execution and evaluating it against a declarative policy. No policy match, no execution.

What the AFB Taxonomy Is For

The AFBs are not a checklist. They are a map of the execution loop — Context → Model → Agent → Act — with the failure points marked at each transition. AFB01 is a failure at the context input. AFB02 is a failure at the model boundary. AFB03 is a failure at the model output. AFB04 is a failure at execution.

Securing an agent system means closing each transition. The taxonomy gives security engineers and developers a shared language to describe exactly where a system is exposed — and what it would take to close it.

If you are building agents in production, start with AFB04 and Wyscan. The exposure is likely already in your codebase.

View full glossary →← All posts