Releases · xmpuspus/ai-workflow-benchmark · GitHub

30 May 12:50

xmpuspus

v1.4.0 — real trace grading Latest

Latest

Audit-driven trust-fix release.

Fixed

Trace grader was vacuous: the runner only emitted token spans, so all 4 rubrics scored 100 on every run. It now translates Claude Code tool events into FILE_EDIT/read/SHELL_COMMAND spans (correlating Bash exit codes), with path relativization through symlinked workspaces. Validated against a real fast-check run.
-j N was a silent no-op without --parallel; now enables parallel mode, and crashed parallel tasks are recorded as FAILs instead of vanishing.
no_out_of_scope_edits honors tests/ directory entries in files_to_examine.

Added

Baselines carry per-run trace_grade + submission readiness/trace_summary (null for span-less traces, no fake 100s). Published claude-code-custom-1.4.0-fast-check.json ships real discriminating grades.
docs/SECURITY.md: shell-exec trust boundary + scoped per-task Docker isolation.

Changed

Aider is a real adapter; runtime deps exact-pinned; invariant guard tests for storefront drift.

274 tests, ruff clean.

Assets 2

24 May 02:53

xmpuspus

v1.3.0 — Storefront and Trust

Highlights (full entry in CHANGELOG.md)

Five-phase storefront + trust release.

Added

`--format json` on `awb gap`, `awb compare`, `awb stability`, `awb leaderboard --readiness`
`awb leaderboard --readiness --explain` Rich Panel with 7 sub-scores per tool
`awb validate` one-line summary (`-v` for per-file)
`is_stub` adapter attribute + stub fail-fast at startup
Real `aider` adapter implementation (still `is_stub = True` pending validation)
`RunEnvironment` provenance: `python_version`, `awb_version`, `adapter_version`, `pip_freeze_hash`
`RunError` typed exception capture in `RunOutcome.error`
All 100 task YAMLs backfilled with `provenance`, `contamination_risk`, `label`
Visual contract in `awb/commands/_shared.py`
`CITATION.cff` + `codemeta.json`
`METHODOLOGY.md` Related Work (HAL, SWE-bench, SWE-bench Pro, LiveCodeBench, METR, Aider Polyglot, Cohen)
GitHub Pages leaderboard workflow

Fixed

JSONL append race under `--parallel` (`fcntl.LOCK_EX` + regression test)
Git operations had no timeout (now `asyncio.wait_for` + per-op timeout kwarg)
`pi` adapter blocked the event loop (now `asyncio.to_thread`)
Doc count drift across README/METHODOLOGY/ADAPTER/ARCHITECTURE
Misquoted Stack Overflow stat + broken SWE-bench Pro link

Changed

Capability profile renders Unicode block bars (█░) plus `(n, conf=...)` annotation
`awb compare` collapsed from 9 columns to 6 with delta colors
README hero hook tightened; em-dash cadence trimmed
Softened "first to" framing; concedes HAL and Artificial Analysis as adjacent prior art

Removed

29 hand-painted Pillow-synthesized demo GIFs (3.2 MB) replaced by one real `vhs`-recorded `demos/hero.gif` (297 KB)

Tests

246 passing (was 240 in v1.2)

Assets 2