A test runner for agentskills.io-style AI agent skills
-
Updated
Jun 17, 2026 - TypeScript
A test runner for agentskills.io-style AI agent skills
SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).
AI agent evolving strategies through automated self-play overnight. Generic framework with GEPA-inspired feedback loop and Elo tracking.
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
Create your self-hosted, open-source Operator model.
Isomorphic agent tooling: author once, run anywhere. Build, lint, route, fan out, eval, trace, guard, contract, and ledger AI-agent workflows across Cursor, Claude Code, Codex, and OpenCode.
A cli tool for evaluating coding agent plugins with a multi-tier approach.
LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.
Deterministic evaluation for LLM agents that reason over state. Scores agents on access, time, causality, and verified absence, not just whether the final answer sounds plausible.
Public Agent Anvil leaderboard submissions and generated index
Build a private evaluation dataset to optimize your organization's token costs.
BondLens AI | 可解释债券分析智能体:AkShare 实时债券数据、工具调用、LLM Guardrail、红队评测、Docker/CI。Explainable fixed-income analysis agent with live data, evals, and safeguards.
Legal Action Boundary Eval (LABE): public proxy eval for legal AI workflows at the action boundary
Reproducible evaluation harness for hidden coordination variables in multi-agent LLM systems.
Alpha benchmark for repo continuation intelligence
Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.
Long-context quality probes and KV-cache research on local GPUs: retrieval is not utilization.
The node-level tracing library for agentic software.
A Multi-Agent System for Cross-Checking Phishing URLs.
Agent evaluation sketches for banking due diligence and research: classification, context verification, pruning, and test-driven workflows.
Add a description, image, and links to the agent-evals topic page so that developers can more easily learn about it.
To associate your repository with the agent-evals topic, visit your repo's landing page and select "manage topics."