agent-eval

Star

Here are 22 public repositories matching this topic...

zozo123 / meta-harness-on-islo

Sponsor

Star

Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.

harbor llm-agents agent-eval meta-harness islo harness-optimization

Updated May 5, 2026
HTML

tenurehq / GroundEval

Star

Deterministic evaluation for LLM agents that reason over state. Scores agents on access, time, causality, and verified absence, not just whether the final answer sounds plausible.

python benchmarking evaluation-framework ai-agents llm-evaluation llm-as-judge llm-as-a-judge agent-evals agent-evaluation agent-eval

Updated Jun 23, 2026
Python

linny006 / agent-eval-harness

Star

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

Updated Jun 23, 2026
Python

0-co / company

Star

AI-operated company. Building agent-friend: universal tool adapter for AI agents. @tool → OpenAI, Claude, Gemini, MCP. Live 24/7 on Twitch.

python twitch structured-logging interactive-cli exponential-backoff human-in-the-loop zero-dependencies open-startup ai-agent autonomous-ai building-in-public llm-tools agent-security mcp-security personal-ai-agent agent-eval agent-friend

Updated Mar 26, 2026
Python

zozo123 / meta-harness-on-islo-page

Sponsor

Star

Project page for Meta-harness on Islo (POC). https://zozo123.github.io/meta-harness-on-islo-page/

project-page agent-eval meta-harness islo

Updated May 5, 2026
JavaScript

arthursoares / openclaw-llm-bench

Star

A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-judge, tier-based leaderboard.

gpt reasoning claude llm-eval ollama llm-as-judge llm-benchmark openclaw agent-eval

Updated Apr 11, 2026
Python

gojiplus / understudy

Star

Scenario Testing for AI Agents

simulation evaluation agentic agent-evaluation google-adk agent-eval

Updated Jun 16, 2026
Python

tushariitr-19 / assay

Star

Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks.

testing cli golang mcp evaluation adk ai-agents llm-eval model-context-protocol agent-eval

Updated Jun 17, 2026
Go

fitchmultz / agent-eval

Star

Transcript-first evaluation tool for comparing coding-agent sessions across Codex, Claude Code, and Pi.

typescript evaluation pi transcripts codex coding-agents claude-code agent-eval

Updated May 29, 2026
TypeScript

jeremylongshore / intent-eval-lab

Sponsor

Star

Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).

mcp skill-discovery opentelemetry ai-evaluation gemini-cli claude-code plugin-testing cross-cli agent-eval invocation-rate

Updated Jun 21, 2026
Python

hamzaplojovic / godel-rwkv

Star

96K param RWKV-7 that detects non-termination (the Gödel sentence analog) zero-shot across SKI combinatory logic, lambda calculus, and Turing machines

machine-learning mlx llm rwkv claude-code swe-bench agent-eval

Updated Jun 15, 2026
Python

workloftai / auto-rubrics

Star

Auto-generate evaluation rubrics from agent audit-log trajectories (PhoneWorld pattern applied to action logs)

evaluation rubrics llm llm-as-judge agent-eval

Updated Jun 23, 2026
Python

zendodx / evalkit-framework

Star

🚀 基于Java的开源AI自动化评测框架 / An open source AI automation evaluation framework based on Java

java framework java-8 eval eval-framework ai-eval agent-eval

Updated Jun 11, 2026
Java

ttxs69 / awesome-coding-agent-eval

Star

A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents.

benchmark leaderboard evaluation awesome-list codex ai-agent llm aider claude-code coding-agent swe-bench agent-eval ai-coding-agent-benchmark coding-agent-benchmark

Updated Jun 8, 2026

rogerchappel / ledgerpet

Star

Local-first synthetic finance anomaly trainer for agent evals.

cli synthetic-data local-first agent-eval finance-ops

Updated Jun 19, 2026
JavaScript

mizcausevic-dev / agent-eval-arena

Star

Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.

express typescript platform-engineering regression-detection ml-ops ai-platform ai-governance llm-eval agent-eval ci-gate

Updated Jun 22, 2026
TypeScript

Viprasol-Tech / agentcheck

Star

Regression testing for AI agents — snapshot tool-calls, diff in CI, fail on regressions. A GitHub Action. By Viprasol Tech.

testing typescript ci snapshot-testing regression-testing ai-agents github-action llm llmops agent-eval

Updated Jun 7, 2026
TypeScript

hermes-labs-ai / agent-convergence-scorer

Star

agent-convergence-scorer is a CLI and Python library that scores how lexically similar N agent or LLM outputs are: exact-match rate, Jaccard token overlap, divergence point, and a composite convergence score over any list of runs. An eval primitive for measuring reproducibility and fan-out collapse. Lexical, not semantic. Zero deps.

cli benchmark consistency evaluation similarity multi-agent convergence reproducibility agents jaccard divergence llm llm-evaluation ai-reliability eval-harness agent-eval

Updated Jun 7, 2026
Python

pingwest-ai / agent-eval

Star

开源通用 AI Agent 真实任务评测 · 同 Prompt、客观开奖、评分细则全公开 | Open-source evaluation of general-purpose AI Agents on real-world tasks with verifiable outcomes — by PingWest / 硅星人

benchmark evaluation ai-agents llm llm-evaluation deep-research agent-eval

Updated Jun 13, 2026

stevenchouai / agent-scorecard

Star

Trace-first evaluation harness for deciding whether AI agents deserve more tokens, permissions, and trust

python evaluation roi ai-agents proof-chain agent-eval

Updated May 16, 2026
Python

Improve this page

Add a description, image, and links to the agent-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-eval

Here are 22 public repositories matching this topic...

zozo123 / meta-harness-on-islo

tenurehq / GroundEval

linny006 / agent-eval-harness

0-co / company

zozo123 / meta-harness-on-islo-page

arthursoares / openclaw-llm-bench

gojiplus / understudy

tushariitr-19 / assay

fitchmultz / agent-eval

jeremylongshore / intent-eval-lab

hamzaplojovic / godel-rwkv

workloftai / auto-rubrics

zendodx / evalkit-framework

ttxs69 / awesome-coding-agent-eval

rogerchappel / ledgerpet

mizcausevic-dev / agent-eval-arena

Viprasol-Tech / agentcheck

hermes-labs-ai / agent-convergence-scorer

pingwest-ai / agent-eval

stevenchouai / agent-scorecard

Improve this page

Add this topic to your repo