Epic: auto-dig-into-benchmarks → suggest harness improvements (analyzer rollout)

## Goal
An automatic system that, after a benchmark run, **digs into the failures and suggests harness (ironclaw) improvements** — automating the manual loop of reading a failing task's trajectory + ironclaw debug logs, root-causing via codegraph, and proposing a minimal fix as a draft PR.

## Where it stands
The core analyzer landed in **#98** (`nearai-bench analyze <run-uuid>`):
- **Phase 0 (done):** select failures → cluster by debug-log signature (`dispatch_failure_kind=…`, `ScopeMismatch`, `HostUnavailable{stage:…}`) → budgeted, injection-delimited context bundle. Validated: clustering reproduced the real root-cause decomposition of a known-failing pinchbench run.
- **Phase 1 (done):** headless Claude Code driver (`--apply-fixes`) — runs `claude -p` with codegraph over ironclaw, a narrow allow-list + the `analyze-failure` SKILL, opens ≤1 minimal draft PR per cluster, writes `analysis.json`. Validated live: the guardrail correctly declined to open a PR on a non-localizable quality cluster.

The worked examples it automates (the harness fixes a human produced this session): ironclaw #4790 (multi-tool-call), #4827 (http empty-body), #4837 (final-answer nudge).

## Remaining phases (child issues)
- [ ] #100 — CI trigger: auto-run on `/benchmark` runs with regressions
- [ ] #101 — CI trigger: nightly run on the baseline
- [ ] #102 — `/benchmark` self-validation loop on fix PRs
- [ ] #103 — Provision codegraph index + ANTHROPIC_API_KEY in CI (infra prerequisite)
- [ ] #104 — Surface `analysis.json` in the PR comment + viewer
- [ ] #105 — Hardening: cost caps, prompt-injection bounds, optional `--llm-cluster`

Suggested order: #103 (unblocks CI fixes) → #100 → #102 → #104 → #101 → #105.

Plan reference: the design + phased rollout is documented in the analyzer PR (#98).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Epic: auto-dig-into-benchmarks → suggest harness improvements (analyzer rollout) #99

Goal

Where it stands

Remaining phases (child issues)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Epic: auto-dig-into-benchmarks → suggest harness improvements (analyzer rollout) #99

Description

Goal

Where it stands

Remaining phases (child issues)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions