Releases: xmpuspus/ai-workflow-benchmark
Releases · xmpuspus/ai-workflow-benchmark
v1.4.0 — real trace grading
Audit-driven trust-fix release.
Fixed
- Trace grader was vacuous: the runner only emitted token spans, so all 4 rubrics scored 100 on every run. It now translates Claude Code tool events into FILE_EDIT/read/SHELL_COMMAND spans (correlating Bash exit codes), with path relativization through symlinked workspaces. Validated against a real fast-check run.
-j Nwas a silent no-op without--parallel; now enables parallel mode, and crashed parallel tasks are recorded as FAILs instead of vanishing.no_out_of_scope_editshonorstests/directory entries infiles_to_examine.
Added
- Baselines carry per-run
trace_grade+ submissionreadiness/trace_summary(null for span-less traces, no fake 100s). Publishedclaude-code-custom-1.4.0-fast-check.jsonships real discriminating grades. docs/SECURITY.md: shell-exec trust boundary + scoped per-task Docker isolation.
Changed
- Aider is a real adapter; runtime deps exact-pinned; invariant guard tests for storefront drift.
274 tests, ruff clean.
v1.3.0 — Storefront and Trust
Highlights (full entry in CHANGELOG.md)
Five-phase storefront + trust release.
Added
- `--format json` on `awb gap`, `awb compare`, `awb stability`, `awb leaderboard --readiness`
- `awb leaderboard --readiness --explain` Rich Panel with 7 sub-scores per tool
- `awb validate` one-line summary (`-v` for per-file)
- `is_stub` adapter attribute + stub fail-fast at startup
- Real `aider` adapter implementation (still `is_stub = True` pending validation)
- `RunEnvironment` provenance: `python_version`, `awb_version`, `adapter_version`, `pip_freeze_hash`
- `RunError` typed exception capture in `RunOutcome.error`
- All 100 task YAMLs backfilled with `provenance`, `contamination_risk`, `label`
- Visual contract in `awb/commands/_shared.py`
- `CITATION.cff` + `codemeta.json`
- `METHODOLOGY.md` Related Work (HAL, SWE-bench, SWE-bench Pro, LiveCodeBench, METR, Aider Polyglot, Cohen)
- GitHub Pages leaderboard workflow
Fixed
- JSONL append race under `--parallel` (`fcntl.LOCK_EX` + regression test)
- Git operations had no timeout (now `asyncio.wait_for` + per-op timeout kwarg)
- `pi` adapter blocked the event loop (now `asyncio.to_thread`)
- Doc count drift across README/METHODOLOGY/ADAPTER/ARCHITECTURE
- Misquoted Stack Overflow stat + broken SWE-bench Pro link
Changed
- Capability profile renders Unicode block bars (█░) plus `(n, conf=...)` annotation
- `awb compare` collapsed from 9 columns to 6 with delta colors
- README hero hook tightened; em-dash cadence trimmed
- Softened "first to" framing; concedes HAL and Artificial Analysis as adjacent prior art
Removed
- 29 hand-painted Pillow-synthesized demo GIFs (3.2 MB) replaced by one real `vhs`-recorded `demos/hero.gif` (297 KB)
Tests
- 246 passing (was 240 in v1.2)