Skip to content

AgentThreatBench: OWASP Agentic Top 10 benchmark for indirect prompt injection and memory poisoning #1668

@vgudur-dev

Description

@vgudur-dev

Proposal: Add AgentThreatBench to the OpenAI Evals registry

AgentThreatBench is an evaluation suite that operationalizes the OWASP Top 10 for Agentic Applications (2026) into executable benchmark tasks. It was recently merged into the official UK AI Safety Institute's inspect_evals repository.

What it evaluates

Unlike existing evals that test chatbot-style safety, AgentThreatBench specifically targets agentic threat vectors — attacks that arrive through tool outputs, memory stores, and environment data rather than user prompts:

Task OWASP Category Attack Vector
Memory Poisoning ASI06 Adversarial entries in RAG/memory store
Autonomy Hijack ASI01 Indirect prompt injection via read_inbox tool output
Data Exfiltration ASI01 Malicious payload in lookup_customer tool response

Dual-metric scoring

Each task scores on both:

  • Utility: Did the agent complete the legitimate task?
  • Security: Did the agent resist the attack?

An agent only "passes" if it scores 1.0 on both metrics — capturing the security/utility tradeoff that standard evals miss.

How to run it today

pip install inspect_evals

# Memory poisoning against GPT-4o
inspect eval inspect_evals/agent_threat_bench_memory_poison --model openai/gpt-4o

# Autonomy hijack against GPT-4o
inspect eval inspect_evals/agent_threat_bench_autonomy_hijack --model openai/gpt-4o

Resources

Would love to discuss adding this to the OpenAI Evals registry or referencing it in the evals documentation as a resource for agentic security evaluation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions