Proposal: Add AgentThreatBench to the OpenAI Evals registry
AgentThreatBench is an evaluation suite that operationalizes the OWASP Top 10 for Agentic Applications (2026) into executable benchmark tasks. It was recently merged into the official UK AI Safety Institute's inspect_evals repository.
What it evaluates
Unlike existing evals that test chatbot-style safety, AgentThreatBench specifically targets agentic threat vectors — attacks that arrive through tool outputs, memory stores, and environment data rather than user prompts:
| Task |
OWASP Category |
Attack Vector |
| Memory Poisoning |
ASI06 |
Adversarial entries in RAG/memory store |
| Autonomy Hijack |
ASI01 |
Indirect prompt injection via read_inbox tool output |
| Data Exfiltration |
ASI01 |
Malicious payload in lookup_customer tool response |
Dual-metric scoring
Each task scores on both:
- Utility: Did the agent complete the legitimate task?
- Security: Did the agent resist the attack?
An agent only "passes" if it scores 1.0 on both metrics — capturing the security/utility tradeoff that standard evals miss.
How to run it today
pip install inspect_evals
# Memory poisoning against GPT-4o
inspect eval inspect_evals/agent_threat_bench_memory_poison --model openai/gpt-4o
# Autonomy hijack against GPT-4o
inspect eval inspect_evals/agent_threat_bench_autonomy_hijack --model openai/gpt-4o
Resources
Would love to discuss adding this to the OpenAI Evals registry or referencing it in the evals documentation as a resource for agentic security evaluation.
Proposal: Add AgentThreatBench to the OpenAI Evals registry
AgentThreatBench is an evaluation suite that operationalizes the OWASP Top 10 for Agentic Applications (2026) into executable benchmark tasks. It was recently merged into the official UK AI Safety Institute's
inspect_evalsrepository.What it evaluates
Unlike existing evals that test chatbot-style safety, AgentThreatBench specifically targets agentic threat vectors — attacks that arrive through tool outputs, memory stores, and environment data rather than user prompts:
read_inboxtool outputlookup_customertool responseDual-metric scoring
Each task scores on both:
An agent only "passes" if it scores 1.0 on both metrics — capturing the security/utility tradeoff that standard evals miss.
How to run it today
Resources
Would love to discuss adding this to the OpenAI Evals registry or referencing it in the evals documentation as a resource for agentic security evaluation.