Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.
-
Updated
May 5, 2026 - HTML
Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.
Deterministic evaluation for LLM agents that reason over state. Scores agents on access, time, causality, and verified absence, not just whether the final answer sounds plausible.
Live, open-source benchmark for comparing AI coding agents on real GitHub issues
AI-operated company. Building agent-friend: universal tool adapter for AI agents. @tool → OpenAI, Claude, Gemini, MCP. Live 24/7 on Twitch.
Project page for Meta-harness on Islo (POC). https://zozo123.github.io/meta-harness-on-islo-page/
A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-judge, tier-based leaderboard.
Scenario Testing for AI Agents
Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks.
Transcript-first evaluation tool for comparing coding-agent sessions across Codex, Claude Code, and Pi.
Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).
96K param RWKV-7 that detects non-termination (the Gödel sentence analog) zero-shot across SKI combinatory logic, lambda calculus, and Turing machines
Auto-generate evaluation rubrics from agent audit-log trajectories (PhoneWorld pattern applied to action logs)
🚀 基于Java的开源AI自动化评测框架 / An open source AI automation evaluation framework based on Java
A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents.
Local-first synthetic finance anomaly trainer for agent evals.
Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.
Regression testing for AI agents — snapshot tool-calls, diff in CI, fail on regressions. A GitHub Action. By Viprasol Tech.
agent-convergence-scorer is a CLI and Python library that scores how lexically similar N agent or LLM outputs are: exact-match rate, Jaccard token overlap, divergence point, and a composite convergence score over any list of runs. An eval primitive for measuring reproducibility and fan-out collapse. Lexical, not semantic. Zero deps.
开源通用 AI Agent 真实任务评测 · 同 Prompt、客观开奖、评分细则全公开 | Open-source evaluation of general-purpose AI Agents on real-world tasks with verifiable outcomes — by PingWest / 硅星人
Trace-first evaluation harness for deciding whether AI agents deserve more tokens, permissions, and trust
Add a description, image, and links to the agent-eval topic page so that developers can more easily learn about it.
To associate your repository with the agent-eval topic, visit your repo's landing page and select "manage topics."