agent-evals

Star

Here are 53 public repositories matching this topic...

darkrishabh / agent-skills-eval

Star

A test runner for agentskills.io-style AI agent skills

cli yaml typescript ai-agents jsonl llm-evaluation llm-evals agent-evals agent-skills openai-compatible agentskills

Updated Jun 17, 2026
TypeScript

HumphreySun98 / repoagentbench

Star

SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).

benchmark developer-tools ai-agents aider llm-eval coding-agents agent-evals swe-bench gemini-3-1-pro claude-opus-4-7 gpt-5-5

Updated Apr 30, 2026
Python

MrTsepa / autoevolve

Star

AI agent evolving strategies through automated self-play overnight. Generic framework with GEPA-inspired feedback loop and Elo tracking.

python genetic-algorithm evolutionary-algorithms game-ai autonomous-agents ai-agents self-play prompt-optimization llm-agents agent-evals

Updated Jun 18, 2026
Python

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Oct 6, 2025
Python

iMeanAI / open-source-operator

Star

Create your self-hosted, open-source Operator model.

training-infra agent-evals gui-agent browseruse native-agent-model

Updated Apr 10, 2025
Python

Agent-Pattern-Labs / iso

Star

Isomorphic agent tooling: author once, run anywhere. Build, lint, route, fan out, eval, trace, guard, contract, and ledger AI-agent workflows across Cursor, Claude Code, Codex, and OpenCode.

Updated Jun 5, 2026
TypeScript

bitwise-media-group / evolve

Star

A cli tool for evaluating coding agent plugins with a multi-tier approach.

gemini cursor agent-evals claude-code

Updated Jun 23, 2026
Go

shubchat / loab

Star

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.

multi-agent lending ai-safety llm-agents llm-benchmarking agent-evals tool-use-ai

Updated Mar 31, 2026
Python

tenurehq / GroundEval

Star

Deterministic evaluation for LLM agents that reason over state. Scores agents on access, time, causality, and verified absence, not just whether the final answer sounds plausible.

python benchmarking evaluation-framework ai-agents llm-evaluation llm-as-judge llm-as-a-judge agent-evals agent-evaluation agent-eval

Updated Jun 23, 2026
Python

agent-axiom / agent-anvil-leaderboard

Star

Public Agent Anvil leaderboard submissions and generated index

leaderboard openai ai-agents agent-evals

Updated Jun 8, 2026
Python

s1liconcow / repogauge

Star

Build a private evaluation dataset to optimize your organization's token costs.

token-cost agent-evals swe-bench

Updated Apr 26, 2026
Python

Phoenix0531-sudo / BondLens

Star

BondLens AI | 可解释债券分析智能体：AkShare 实时债券数据、工具调用、LLM Guardrail、红队评测、Docker/CI。Explainable fixed-income analysis agent with live data, evals, and safeguards.