Agents Are in Production. Your Controls Probably Aren't.
Back to all posts

Agents Are in Production. Your Controls Probably Aren't.

10 min read
#agentic-ai #ai-red-team #ai-security #owasp

“We didn’t think anyone would actually wire an MCP server to a CRM with write access in production. Then we found three of them in the same engagement.”

  • Paraphrased from a recent agentic AI security review

A Year Ago This Was Theory. Now It Isn’t.

In April 2025, anyone trying to threat-model an AI agent had mostly speculation to work with. There were taxonomies, frameworks, hand-waves about prompt injection. The serious red-team work was happening, but most of it wasn’t published, and almost none of it was specific.

A year later, that’s changed. Microsoft’s AI Red Team released an updated taxonomy of agentic AI failure modes in April 2026 that cataloged seven failure classes they hadn’t predicted in their original 2025 version. OWASP shipped its Top 10 for Agentic Applications. Cloud Security Alliance published a 62-page red-teaming guide. MITRE released SAFE-AI to map agentic risk to NIST 800-53.

The frameworks are now converging on the same picture, from different starting points. Which means we’ve moved past the part where security teams can claim the threat surface is too new to act on. It isn’t. The evidence base from 2025-2026 red-team engagements is specific, repeatable, and increasingly hard to ignore.

This post is about what to do with that evidence, depending on where your team actually sits.


The Failure Modes That Matter Right Now

The shortlist, in plain English:

  • Agentic supply chain compromise - natural-language payloads hidden in plugin manifests, MCP tool descriptions, and prompt templates. Behavior changes without anyone touching a binary.
  • Goal hijacking - an attacker bends the agent’s objective. The agent thinks it’s doing its job. The user thinks they’re getting honest output.
  • Inter-agent trust escalation - confused-deputy at the natural-language layer. Sub-agent claims a role, orchestrator believes it.
  • CUA visual attacks - adversarial pixels, fake approval buttons, low-contrast banners. Computer-use agents are blind to things humans would catch.
  • Session context contamination - early-session bias accumulates. No single step looks anomalous. The compound effect is the attack.
  • MCP/plugin abuse - tool-description poisoning, cross-server instruction override, protocol-trust abuse. The fastest-growing category I’m seeing.
  • Capability/architecture disclosure - the agent leaks its tool list, schemas, or memory structure. This is the reconnaissance step that turns generic prompt injection into precision attacks.

You’ll notice none of these are model-level problems. They’re system-level. Which is why your existing red-team approach - if it’s still pointed at the model - is missing them.


Triage by Where You Actually Are

Listing failure modes by severity produces a roadmap nobody can execute. The honest cut is by maturity. The startup shipping its first MCP-enabled agent has a different problem than the bank running orchestrated agent networks. Here’s how I’d sequence it.

Startup or First Production Agent: Get the Floor Right

The goal isn’t comprehensive coverage. It’s not failing on the basics that already have public exploits.

What bites first:

  • MCP/plugin abuse. If your agent connects to any MCP server you didn’t write, you have this risk today. Tool descriptions are natural-language code the model treats as instructions. A poisoned description isn’t a vulnerability the model can defend against - it’s working as designed.
  • Capability/architecture disclosure. Cheap to exploit, cheap to fix. If your agent answers “what tools do you have?” honestly, you’ve handed an attacker the schema for every follow-on payload.
  • Excessive agency. The boring one that keeps delivering incidents.

What to ship this quarter:

  • Pin and vet every external MCP server, plugin, and tool description. Treat the tool registry like any other dependency: version-pinned, change-monitored, signature-verified where possible. OWASP ASI04 (Supply Chain) covers this directly.
  • Default-refuse on introspection. Decline tool-list, schema, and system-prompt disclosure requests - whether they come from a user, a retrieved document, or a peer agent.
  • Minimize privileged surface. An agent that can’t perform high-impact actions is less valuable to leak. Smallest permission set wins.
  • Lock down human-in-the-loop for irreversible actions. Tiered approval. More friction for cross-tenant, financial, or destructive operations. Yes, it slows things down. That’s the point.

This is the floor. If you’re not doing these, everything below is academic.

Mid-Market: Build for the Compound Attacks

Once you have multiple agents in production, or agents that share memory across sessions, attacks stop being single-step and start being chains. The patterns showing up in recent red-team work aren’t novel exploits - they’re combinations of known weaknesses that compound into something nobody had thought to test for.

What bites next:

  • Goal hijacking. I see teams underestimate this one. An attacker doesn’t need to take over the agent. They just need to bend the objective. “Recommend the product that best matches the user’s needs” becomes “recommend the product that has secretly higher commission for the attacker.”
  • Session context contamination. Each individual step looks fine. The fifteenth step is the one where the agent does something it shouldn’t, and forensics finds the root cause was a single document retrieved on step three.
  • Memory poisoning and HitL bypass. The original sins. Still under-mitigated. The 2025 email-assistant case study showed 80%+ attack success after a single system prompt change. That hasn’t gotten easier to defend against.

What to add at this stage:

  • Context-provenance tracking. Every token in the agent’s context carries a source tag - trusted system prompt, user turn, retrieved document, tool response, peer agent. Policy decides which classes can influence which actions. This is the single most impactful architectural change available right now, and it is not something you can bolt on later.
  • Deterministic HitL invocation. The agent doesn’t decide when to ask for approval. The system decides, based on action class. Compound actions get decomposed - if the agent asks for one approval to do five things, the approval shows all five.
  • Semantic summarization of approval prompts. Don’t display the agent’s own description of what it’s about to do. Summarize from the underlying tool calls. This kills description laundering - where an agent re-explains a dangerous action in benign terms because that’s what the upstream prompt told it to do.
  • Memory integrity checks. Validate memory reads against declared sources. Watch ratio shifts - if 12% of memory writes used to come from external content and now it’s 31%, that’s a signal.
  • Map your controls to OWASP ASI01-ASI04. Gives your coverage external scaffolding and makes audit conversations dramatically easier.

Enterprise: Get Ahead of the Network Effects

For teams running agent ecosystems - multiple business units, orchestrated workflows, agents that delegate to other agents - the failure modes that matter are the ones that emerge from the network, not from any single agent. The “societies of agents” framing is no longer a future problem at this scale.

What bites here:

  • Inter-agent trust escalation. Sub-agent says “I’m the orchestrator’s deputy, grant me admin scope.” The orchestrator has no cryptographic way to verify the claim. Privileges expand. This is the agent-network equivalent of an internal AD trust failure, except the trust signal is natural language.
  • Agentic supply chain compromise. Your agents consume tool definitions, prompt templates, skill repositories, and MCP servers from dozens of internal and external sources. Snyk’s ToxicSkills research documented 1,467 vulnerable agent skills in public registries. At enterprise scale, you have this in your internal registry too. You just haven’t found it yet.
  • CUA visual attacks. Only relevant if you’re shipping computer-use agents - but if you are, the screenshot-loop attack surface is real and not covered by existing controls.

What to architect for:

  • Zero-trust inter-agent identity. Every agent gets an attestable credential at provisioning. Every inter-agent message carries a verifiable identity claim. Orchestrators verify the credential chain before making privilege decisions, not after. MITRE SAFE-AI maps this directly to NIST SP 800-53 Rev 5 access-control families, which gives you a way to talk to enterprise risk teams in language they already speak.
  • Agentic SBOM. A software bill of materials that includes tool dependencies, not just code dependencies. Pin versions. Monitor changes. Even “patch” version bumps can change natural-language tool behavior.
  • Adversarial session hardening at scale. Bounded session contexts. Session-integrity monitoring. Policies that limit tool calls once a session has accepted external content. A session is a security boundary, not a scratchpad.
  • Network-level anomaly detection. Not just per-agent. Look at cross-agent flows: is one agent delegating more than baseline? Are coalitions of agents forming that didn’t exist last week? You want telemetry for this before you need it.

What the Frameworks Underplay

This is where my experience and the published material diverge.

HitL UX is still failing in ways the guidance underplays. The principle that “UX design is a security control” is right. It’s also describing an industry state that hasn’t shipped. In actual production agents I’ve reviewed, approval prompts are still bare strings. There is no tiered approval. There is no semantic summarization. The “trust this folder” pattern that TrustFall exploited in May 2026 is the rule, not the exception. The gap between best practice and field reality is the largest single risk I see.

Vendor selection is the silent control. The frameworks talk about plugins, MCP servers, and tool descriptions as artifacts to harden. In practice, the most consequential control is which vendors you let into the agent’s tool inventory in the first place. I’ve watched organizations spend three weeks debating prompt-injection filters while signing a contract with an MCP vendor whose entire change log is “improvements.” Get vendor selection right and half the supply-chain failure modes become tractable.

Internal agents are not less risky than external ones. A pattern I keep encountering: teams apply rigorous controls to customer-facing agents and basically zero controls to the internal IT agent, the data-science agent, the engineering-productivity agent. The blast radius of an internal agent with access to source code, CI/CD, and credentials is usually larger than the chatbot.

The next wave is agent skill marketplaces. This is where I expect the next class of attacks - the equivalent of npm slopsquatting, but for natural-language capabilities. If your security program doesn’t have a story for vetting skills before they enter your ecosystem, you’re going to have one within six months.


Three Things I’m Changing in My Own Program

  1. Context-provenance tracking is now a standing item on my agent design-review checklist. If it’s missing, that’s a finding.
  2. Capability/architecture disclosure is a Sev-2 by default. Reconnaissance is what makes everything else cheaper.
  3. “We have human-in-the-loop” is no longer an acceptable control answer without follow-up. Deterministic invocation? Compound-action decomposition? Tiered approval? If the answer to any of those is no, HitL is a check-the-box control, not a real one.

A year ago, agentic AI security was an argument about which threats were real. That argument is mostly over. The teams that act on the answers are going to be in dramatically better shape than the ones still waiting for more evidence.


Sources

  1. Taxonomy of Failure Modes in Agentic AI Systems, v2.0 - Microsoft AI Red Team, April 2026
  2. Taxonomy of Failure Mode in Agentic AI Systems (v1.0) - Microsoft AI Red Team, April 2025
  3. OWASP Top 10 for Agentic Applications - OWASP Gen AI Security Project
  4. Agentic AI Red Teaming Guide - Cloud Security Alliance
  5. MITRE SAFE-AI Full Report - MITRE
  6. NIST AI 600-1, Artificial Intelligence Risk Management Framework - NIST
  7. Announcing the CoSAI Principles for Secure-by-Design Agentic Systems - Coalition for Secure AI
  8. ToxicSkills: Malicious AI Agent Skills - Snyk, February 2026
  9. TrustFall Attack Reveals AI Supply Chain Threat - NetworkUstad, May 2026