Skip to content

Releases: xmpuspus/ai-workflow-benchmark

v1.4.0 — real trace grading

30 May 12:50
49fe01c

Choose a tag to compare

Audit-driven trust-fix release.

Fixed

  • Trace grader was vacuous: the runner only emitted token spans, so all 4 rubrics scored 100 on every run. It now translates Claude Code tool events into FILE_EDIT/read/SHELL_COMMAND spans (correlating Bash exit codes), with path relativization through symlinked workspaces. Validated against a real fast-check run.
  • -j N was a silent no-op without --parallel; now enables parallel mode, and crashed parallel tasks are recorded as FAILs instead of vanishing.
  • no_out_of_scope_edits honors tests/ directory entries in files_to_examine.

Added

  • Baselines carry per-run trace_grade + submission readiness/trace_summary (null for span-less traces, no fake 100s). Published claude-code-custom-1.4.0-fast-check.json ships real discriminating grades.
  • docs/SECURITY.md: shell-exec trust boundary + scoped per-task Docker isolation.

Changed

  • Aider is a real adapter; runtime deps exact-pinned; invariant guard tests for storefront drift.

274 tests, ruff clean.

v1.3.0 — Storefront and Trust

24 May 02:53

Choose a tag to compare

Highlights (full entry in CHANGELOG.md)

Five-phase storefront + trust release.

Added

  • `--format json` on `awb gap`, `awb compare`, `awb stability`, `awb leaderboard --readiness`
  • `awb leaderboard --readiness --explain` Rich Panel with 7 sub-scores per tool
  • `awb validate` one-line summary (`-v` for per-file)
  • `is_stub` adapter attribute + stub fail-fast at startup
  • Real `aider` adapter implementation (still `is_stub = True` pending validation)
  • `RunEnvironment` provenance: `python_version`, `awb_version`, `adapter_version`, `pip_freeze_hash`
  • `RunError` typed exception capture in `RunOutcome.error`
  • All 100 task YAMLs backfilled with `provenance`, `contamination_risk`, `label`
  • Visual contract in `awb/commands/_shared.py`
  • `CITATION.cff` + `codemeta.json`
  • `METHODOLOGY.md` Related Work (HAL, SWE-bench, SWE-bench Pro, LiveCodeBench, METR, Aider Polyglot, Cohen)
  • GitHub Pages leaderboard workflow

Fixed

  • JSONL append race under `--parallel` (`fcntl.LOCK_EX` + regression test)
  • Git operations had no timeout (now `asyncio.wait_for` + per-op timeout kwarg)
  • `pi` adapter blocked the event loop (now `asyncio.to_thread`)
  • Doc count drift across README/METHODOLOGY/ADAPTER/ARCHITECTURE
  • Misquoted Stack Overflow stat + broken SWE-bench Pro link

Changed

  • Capability profile renders Unicode block bars (█░) plus `(n, conf=...)` annotation
  • `awb compare` collapsed from 9 columns to 6 with delta colors
  • README hero hook tightened; em-dash cadence trimmed
  • Softened "first to" framing; concedes HAL and Artificial Analysis as adjacent prior art

Removed

  • 29 hand-painted Pillow-synthesized demo GIFs (3.2 MB) replaced by one real `vhs`-recorded `demos/hero.gif` (297 KB)

Tests

  • 246 passing (was 240 in v1.2)