Skip to content

Epic: auto-dig-into-benchmarks → suggest harness improvements (analyzer rollout) #99

Description

@pranavraja99

Goal

An automatic system that, after a benchmark run, digs into the failures and suggests harness (ironclaw) improvements — automating the manual loop of reading a failing task's trajectory + ironclaw debug logs, root-causing via codegraph, and proposing a minimal fix as a draft PR.

Where it stands

The core analyzer landed in #98 (nearai-bench analyze <run-uuid>):

  • Phase 0 (done): select failures → cluster by debug-log signature (dispatch_failure_kind=…, ScopeMismatch, HostUnavailable{stage:…}) → budgeted, injection-delimited context bundle. Validated: clustering reproduced the real root-cause decomposition of a known-failing pinchbench run.
  • Phase 1 (done): headless Claude Code driver (--apply-fixes) — runs claude -p with codegraph over ironclaw, a narrow allow-list + the analyze-failure SKILL, opens ≤1 minimal draft PR per cluster, writes analysis.json. Validated live: the guardrail correctly declined to open a PR on a non-localizable quality cluster.

The worked examples it automates (the harness fixes a human produced this session): ironclaw #4790 (multi-tool-call), #4827 (http empty-body), #4837 (final-answer nudge).

Remaining phases (child issues)

Suggested order: #103 (unblocks CI fixes) → #100#102#104#101#105.

Plan reference: the design + phased rollout is documented in the analyzer PR (#98).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions