Goal
An automatic system that, after a benchmark run, digs into the failures and suggests harness (ironclaw) improvements — automating the manual loop of reading a failing task's trajectory + ironclaw debug logs, root-causing via codegraph, and proposing a minimal fix as a draft PR.
Where it stands
The core analyzer landed in #98 (nearai-bench analyze <run-uuid>):
- Phase 0 (done): select failures → cluster by debug-log signature (
dispatch_failure_kind=…, ScopeMismatch, HostUnavailable{stage:…}) → budgeted, injection-delimited context bundle. Validated: clustering reproduced the real root-cause decomposition of a known-failing pinchbench run.
- Phase 1 (done): headless Claude Code driver (
--apply-fixes) — runs claude -p with codegraph over ironclaw, a narrow allow-list + the analyze-failure SKILL, opens ≤1 minimal draft PR per cluster, writes analysis.json. Validated live: the guardrail correctly declined to open a PR on a non-localizable quality cluster.
The worked examples it automates (the harness fixes a human produced this session): ironclaw #4790 (multi-tool-call), #4827 (http empty-body), #4837 (final-answer nudge).
Remaining phases (child issues)
Suggested order: #103 (unblocks CI fixes) → #100 → #102 → #104 → #101 → #105.
Plan reference: the design + phased rollout is documented in the analyzer PR (#98).
Goal
An automatic system that, after a benchmark run, digs into the failures and suggests harness (ironclaw) improvements — automating the manual loop of reading a failing task's trajectory + ironclaw debug logs, root-causing via codegraph, and proposing a minimal fix as a draft PR.
Where it stands
The core analyzer landed in #98 (
nearai-bench analyze <run-uuid>):dispatch_failure_kind=…,ScopeMismatch,HostUnavailable{stage:…}) → budgeted, injection-delimited context bundle. Validated: clustering reproduced the real root-cause decomposition of a known-failing pinchbench run.--apply-fixes) — runsclaude -pwith codegraph over ironclaw, a narrow allow-list + theanalyze-failureSKILL, opens ≤1 minimal draft PR per cluster, writesanalysis.json. Validated live: the guardrail correctly declined to open a PR on a non-localizable quality cluster.The worked examples it automates (the harness fixes a human produced this session): ironclaw #4790 (multi-tool-call), #4827 (http empty-body), #4837 (final-answer nudge).
Remaining phases (child issues)
/benchmarkruns with regressions/benchmarkself-validation loop on fix PRsanalysis.jsonin the PR comment + viewer--llm-clusterSuggested order: #103 (unblocks CI fixes) → #100 → #102 → #104 → #101 → #105.
Plan reference: the design + phased rollout is documented in the analyzer PR (#98).