Daily benchmark-eval bugs — 2026-06-25
Suites analyzed
- pinchbench (126 non-pass) — On a 149-task run (126 non-pass), the failure story is dominated by infrastructure, not ironclaw. 92 tasks scored 0.00 with 'reborn turn timed out', and every one of them logs repeated
error sending request for url (https://cloud-api.near.ai/v1/chat/completions) retries — the NEAR AI cloud endpoint serving DeepSeek-V4-Flash was connection-flaky for the whole run, so the heavy log/CSV/meeting tasks accumulated retry latency and blew the bench's 600s per-turn timeout (benchmarks/src/reborn_runner.rs:345), which then discards the partial work and records an empty 0.00. The next-largest bucket is benign model-quality margin misses (15 'partial' tasks judged 0.85–0.95). A handful of integration tasks fail structurally because the bench environment lacks their declared prerequisites (gh/gws CLIs, Gmail OAuth, image-gen provider) — that is bad_eval/seeding, not a model or harness fault. The only genuine ironclaw harness defect surfaced is builtin.http.save fail-closing above a 10 MB response cap, but it is low-impact (2 tasks, both of which also timed out). Note: unlike the prior run, the timeouts here are clearly provider-driven (infra), so the prior 'no loop wall-clock budget' harness framing does not apply this run.
ironclaw rev(s): a76ecb5c8b6451d73a08a6412a2f02fbfc60c542
🧪 Benchmark bugs (bad eval / label / seeding)
pinchbench · Integration tasks require un-provisioned prerequisites — 6 tasks
Either provision the declared prerequisites in the bench runner (auth profiles + gh/gws/@juppytt/fws install + network allowlist for the required hosts) before running these tasks, or filter them out of the scored set / mark them skipped when prerequisites are absent, so they don't count as model/harness failures. This mirrors the documented limitation in suites' pinchbench notes.
task/email/triage · task/image/gen · task/gh/issue/triage · task/gws/task/management · task/workflow · task/daily/summary
Generated by the bench-taxonomy workflow. Categories are dynamic — discovered from each run's failures and root-caused via codegraph/source.
Daily benchmark-eval bugs — 2026-06-25
Suites analyzed
error sending request for url (https://cloud-api.near.ai/v1/chat/completions)retries — the NEAR AI cloud endpoint serving DeepSeek-V4-Flash was connection-flaky for the whole run, so the heavy log/CSV/meeting tasks accumulated retry latency and blew the bench's 600s per-turn timeout (benchmarks/src/reborn_runner.rs:345), which then discards the partial work and records an empty 0.00. The next-largest bucket is benign model-quality margin misses (15 'partial' tasks judged 0.85–0.95). A handful of integration tasks fail structurally because the bench environment lacks their declared prerequisites (gh/gws CLIs, Gmail OAuth, image-gen provider) — that is bad_eval/seeding, not a model or harness fault. The only genuine ironclaw harness defect surfaced is builtin.http.save fail-closing above a 10 MB response cap, but it is low-impact (2 tasks, both of which also timed out). Note: unlike the prior run, the timeouts here are clearly provider-driven (infra), so the prior 'no loop wall-clock budget' harness framing does not apply this run.ironclaw rev(s): a76ecb5c8b6451d73a08a6412a2f02fbfc60c542
🧪 Benchmark bugs (bad eval / label / seeding)
pinchbench · Integration tasks require un-provisioned prerequisites — 6 tasks
Either provision the declared prerequisites in the bench runner (auth profiles +
gh/gws/@juppytt/fwsinstall + network allowlist for the required hosts) before running these tasks, or filter them out of the scored set / mark them skipped when prerequisites are absent, so they don't count as model/harness failures. This mirrors the documented limitation in suites' pinchbench notes.task/email/triage · task/image/gen · task/gh/issue/triage · task/gws/task/management · task/workflow · task/daily/summary
Generated by the bench-taxonomy workflow. Categories are dynamic — discovered from each run's failures and root-caused via codegraph/source.