Daily benchmark-eval bugs — 2026-06-25

# Daily benchmark-eval bugs — 2026-06-25

## Suites analyzed
- [pinchbench (126 non-pass)](https://nearai.github.io/benchmarks/#/runs/ironclaw-reborn/pinchbench/8e3ca3d1-b39a-4f37-bee8-8bef6250bcd1) — On a 149-task run (126 non-pass), the failure story is dominated by infrastructure, not ironclaw. 92 tasks scored 0.00 with 'reborn turn timed out', and every one of them logs repeated `error sending request for url (https://cloud-api.near.ai/v1/chat/completions)` retries — the NEAR AI cloud endpoint serving DeepSeek-V4-Flash was connection-flaky for the whole run, so the heavy log/CSV/meeting tasks accumulated retry latency and blew the bench's 600s per-turn timeout (benchmarks/src/reborn_runner.rs:345), which then discards the partial work and records an empty 0.00. The next-largest bucket is benign model-quality margin misses (15 'partial' tasks judged 0.85–0.95). A handful of integration tasks fail structurally because the bench environment lacks their declared prerequisites (gh/gws CLIs, Gmail OAuth, image-gen provider) — that is bad_eval/seeding, not a model or harness fault. The only genuine ironclaw harness defect surfaced is builtin.http.save fail-closing above a 10 MB response cap, but it is low-impact (2 tasks, both of which also timed out). Note: unlike the prior run, the timeouts here are clearly provider-driven (infra), so the prior 'no loop wall-clock budget' harness framing does not apply this run.

_ironclaw rev(s): a76ecb5c8b6451d73a08a6412a2f02fbfc60c542_

## 🧪 Benchmark bugs (bad eval / label / seeding)

### pinchbench · Integration tasks require un-provisioned prerequisites — 6 tasks
Either provision the declared prerequisites in the bench runner (auth profiles + `gh`/`gws`/`@juppytt/fws` install + network allowlist for the required hosts) before running these tasks, or filter them out of the scored set / mark them skipped when prerequisites are absent, so they don't count as model/harness failures. This mirrors the documented limitation in suites' pinchbench notes.

[task/email/triage](https://nearai.github.io/benchmarks/#/runs/ironclaw-reborn/pinchbench/8e3ca3d1-b39a-4f37-bee8-8bef6250bcd1/fail/task_email_triage) · [task/image/gen](https://nearai.github.io/benchmarks/#/runs/ironclaw-reborn/pinchbench/8e3ca3d1-b39a-4f37-bee8-8bef6250bcd1/fail/task_image_gen) · [task/gh/issue/triage](https://nearai.github.io/benchmarks/#/runs/ironclaw-reborn/pinchbench/8e3ca3d1-b39a-4f37-bee8-8bef6250bcd1/fail/task_gh_issue_triage) · [task/gws/task/management](https://nearai.github.io/benchmarks/#/runs/ironclaw-reborn/pinchbench/8e3ca3d1-b39a-4f37-bee8-8bef6250bcd1/fail/task_gws_task_management) · [task/workflow](https://nearai.github.io/benchmarks/#/runs/ironclaw-reborn/pinchbench/8e3ca3d1-b39a-4f37-bee8-8bef6250bcd1/fail/task_workflow) · [task/daily/summary](https://nearai.github.io/benchmarks/#/runs/ironclaw-reborn/pinchbench/8e3ca3d1-b39a-4f37-bee8-8bef6250bcd1/fail/task_daily_summary)

---
_Generated by the bench-taxonomy workflow. Categories are dynamic — discovered from each run's failures and root-caused via codegraph/source._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Daily benchmark-eval bugs — 2026-06-25 #182

Daily benchmark-eval bugs — 2026-06-25

Suites analyzed

🧪 Benchmark bugs (bad eval / label / seeding)

pinchbench · Integration tasks require un-provisioned prerequisites — 6 tasks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Daily benchmark-eval bugs — 2026-06-25 #182

Description

Daily benchmark-eval bugs — 2026-06-25

Suites analyzed

🧪 Benchmark bugs (bad eval / label / seeding)

pinchbench · Integration tasks require un-provisioned prerequisites — 6 tasks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions