Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,53 @@
# Changelog

## 1.4.0 (2026-05-30)

Trust-fix release from a fresh product audit. The headline differentiator —
deterministic trace grading — was scoring 100 on every run because the runner
never emitted the spans the rubrics needed. This release makes the grader
actually grade (validated against a real fast-check run), fixes two silent
data-loss bugs in the runner, and tightens the storefront.

### Fixed

- **Trace grader was vacuous in production.** The runner only emitted
`LLM_REQUEST` spans (plus a legacy top-level `tool_use` path real Claude Code
never produces), so all four trace-grade rubrics fell through to their
trivial-pass branches and scored 100 on every run. A new `TraceTranslator`
walks the nested `tool_use` blocks in `assistant` content, emits
`FILE_EDIT` / read `TOOL_USE` / `SHELL_COMMAND` spans, and correlates Bash
exit codes from `tool_result` events. File paths are relativized against the
workspace so `no_out_of_scope_edits` matches `files_to_examine`.
- **`-j N` was a silent no-op** without `--parallel`. `-j>1` now enables
parallel mode on its own; `--parallel` alone fans out to 4; default stays
sequential (`-j 1`).
- **Parallel-task crashes vanished from results.** A task that raised on the
parallel path was only logged; it is now recorded as a FAIL with a
traceback, and stub/usage errors abort the run like the sequential path.

### Added

- **Baseline export carries trust columns**: per-run `trace_grade` (null when a
trace has no gradeable spans, so non-streaming tools don't get a fake 100)
and a submission-level `readiness` block + `trace_summary`. The published
`claude-code-custom-1.4.0-fast-check.json` now ships real, discriminating
trace grades (`no_out_of_scope_edits` ranges 17-100 across the 8 tasks).
- **`grade_trace_or_none`** distinguishes a span-less trace from a genuinely
perfect one. `readiness_from_results` is shared by the leaderboard and export.
- **Shell-execution trust boundary documented** in `docs/SECURITY.md`, with
per-task Docker isolation scoped as the next step toward community submissions.

### Changed

- **Aider is a real adapter** (`is_stub = False`); it gates on the binary being
installed. Aider's `--no-stream` means no trace spans, so its trace columns
report `null`.
- **Runtime dependencies are exact-pinned** for reproducible installs.
- **Trace `file.path` is relativized through symlinked workspaces** (macOS
`/tmp` -> `/private/tmp`), and `no_out_of_scope_edits` honors directory
entries (`tests/`) in `files_to_examine` instead of exact-set membership.
- README refreshed to v1.4.0 (lead, install pin, baseline reference).

## 1.3.0 (2026-05-24)

Storefront and trust release: fixes credibility leaks in the docs, adds real
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,5 @@ keywords:
- workflow-evaluation
- shipping-discipline
license: MIT
version: "1.2.0"
date-released: "2026-04-27"
version: "1.4.0"
date-released: "2026-05-30"
35 changes: 19 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
<br/>
<img src="demos/hero.gif" alt="awb run, awb leaderboard --readiness, awb gap output" width="820"/>
<br/>
<sub>v1.2.0: task-set hash, OpenTelemetry-aligned trace artifacts, <code>awb trace grade</code>, and the Production Readiness Score.</sub>
<sub>v1.4.0: trace grading that actually grades, baselines with trust columns, real <code>-j</code> parallelism, and a documented security boundary.</sub>
</div>

---
Expand All @@ -29,15 +29,16 @@ Related work measures complementary axes. [HAL](https://arxiv.org/abs/2510.11977

AWB's distinct contribution is twofold: (1) a paired **vanilla-vs-custom** adapter pair that isolates the workflow-configuration delta for the same model, surfaced as a single Workflow Lift score with a sign-test p-value; (2) **deterministic** trace-grading rubrics (read-tests-before-edit, ran-verification-after-change, no-out-of-scope-edits, no-repeated-failing-loop) computed from OpenTelemetry-aligned `.trace.jsonl` artifacts, not LLM judges. See [METHODOLOGY.md#related-work](METHODOLOGY.md#related-work) for citation details.

## What's New in v1.2.0
## What's New in v1.4.0

- **Task-set hash on every result** (`task_set_hash`, SHA-256 over the bundled task YAMLs) so you can prove which exact task set produced a given score.
- **OpenTelemetry-aligned trace artifact**: every benchmark run writes a `.trace.jsonl` file using OTel GenAI semantic conventions (`gen_ai.client.operation`, `gen_ai.tool.use`) plus AWB-specific spans for shell commands, file edits, and test runs.
- **`awb trace grade <run_dir>`** scores four shipping disciplines from the trace: read-tests-before-edit, ran-verification-after-change, no-out-of-scope-edits, no-repeated-failing-command-loop.
- **Production Readiness Score** (`awb leaderboard --readiness`): weighted composite over 7 dimensions, calibrated for shipping safety.
- **Strict result schema v2** (`additionalProperties: false`, required `schema_version`, `task_set_hash`). `awb migrate-results` extends to v1→v2.
- **Task provenance fields**: optional `provenance.{source_pr_url, created_at, last_verified_at}`, `contamination_risk`, `label` (`real_pr` / `synthetic_overlay` / `mutated` / `fresh`) lay the groundwork for fresh-task harvesting in v1.3.
- **Seven P0 trust blockers fixed**: token-budget fields now parse correctly, workflow schema enum matches the 9-adapter registry, setup cache key is order-sensitive, missing security scanners surface a warning instead of silently passing as clean, adapter `on_event` is properly typed as `Callable`, gap analysis classifies `regression_introduced` and `no_edits_made` separately.
- **Trace grading actually grades now.** The runner used to emit only token spans, so all four trace rubrics scored a vacuous 100 on every run. It now translates Claude Code's nested `tool_use` blocks into `FILE_EDIT` / read / `SHELL_COMMAND` spans and correlates Bash exit codes — producing real, discriminating scores (the published baseline's `no_out_of_scope_edits` ranges 17-100 across the 8 tasks).
- **Baselines carry the trust columns**: per-run `trace_grade` and a submission-level `readiness` + `trace_summary` block, so the published baseline showcases both flagship trust features. A span-less trace reports `null`, never a fake 100.
- **`-j N` works on its own.** `-j>1` enables parallel mode (it used to be a silent no-op without `--parallel`); a crashed parallel task is now recorded as a FAIL with a traceback instead of vanishing from results.
- **Aider is a real adapter** (`is_stub = False`), gated on the binary being installed.
- **Exact-pinned runtime dependencies** for reproducible installs, plus invariant guard tests so the README lead, install pin, and baseline reference can't silently drift.
- **Security posture documented** in [docs/SECURITY.md](docs/SECURITY.md): the shell-execution trust boundary and the per-task Docker isolation planned for safe community submissions.

Carried over from v1.2.0-v1.3.0: public GitHub Pages leaderboard, task-set hash on every result, OpenTelemetry-aligned `.trace.jsonl` artifacts, `awb trace grade`, the Production Readiness Score, strict result schema v2, and reliability + provenance hardening.

## Quick Start

Expand All @@ -54,18 +55,18 @@ awb leaderboard --readiness --explain # Production Readiness Sco

### Five-minute reproducible demo

Run this end-to-end against the published v1.2.0 baseline. Should finish in ~15 minutes for ~$4 of API spend and produce a tweetable Workflow Lift number plus a capability profile.
Run this end-to-end against the published v1.4.0 fast-check baseline. Should finish in ~15 minutes for ~$4 of API spend and produce a tweetable Workflow Lift number plus a capability profile.

```bash
pip install awb==1.2.0
pip install awb==1.4.0
awb quickstart # 1. verify environment
awb warmup --use-uv # 2. pre-build templates
awb run --fast-check claude-code-custom # 3. ~15 min, ~$4, real run
awb leaderboard --readiness --explain # 4. composite readiness score
awb trace grade results/runs/<run_id>/ # 5. behavior rubric scores
```

Compare against the published baseline at `results/baselines/claude-code-custom-1.2.0.json`. Same `task_set_hash` means your numbers are directly comparable.
Compare against the published baseline at `results/baselines/claude-code-custom-1.4.0-fast-check.json`. Same `task_set_hash` means your numbers are directly comparable.

**New in v1.1.0:** `awb warmup` caches workspaces for 10-30x faster setup. `--fast-check` gives a quick signal in 15 min for ~$4. `--progressive` stops early on weak tools. `--use-uv` swaps pip for uv. See [Execution Modes](#execution-modes) below.

Expand All @@ -83,6 +84,8 @@ Clone repo at pinned SHA

Each task starts from a fresh `git clone` at a pinned commit. Every tool gets the same prompt, the same timeout, and the same verification suite. Results are scored with sigmoid normalization so scores are never negative and never collapse at the boundary.

> **Security:** AWB clones third-party repos and runs their setup/test code plus the AI tool with no sandbox. Treat task sets and their repos as trusted input and run in a disposable environment. See [docs/SECURITY.md](docs/SECURITY.md) for the trust boundary and the planned per-task Docker isolation.

## Scoring System

Seven dimensions, sigmoid-normalized with per-task baselines derived from difficulty:
Expand Down Expand Up @@ -518,18 +521,18 @@ See [CHANGELOG.md](CHANGELOG.md) for the full history (v1.0.0, v0.5.x, v0.4.x, v

## Citing AWB

If you use AWB in research, cite the version-specific Zenodo DOI. The concept DOI [`10.5281/zenodo.20361437`](https://doi.org/10.5281/zenodo.20361437) always resolves to the latest release; the version DOI below pins to v1.3.0 specifically. Machine-readable metadata lives in [CITATION.cff](CITATION.cff) and [codemeta.json](codemeta.json); release process is in [docs/zenodo-doi.md](docs/zenodo-doi.md).
If you use AWB in research, cite it via Zenodo. The concept DOI [`10.5281/zenodo.20361437`](https://doi.org/10.5281/zenodo.20361437) always resolves to the latest release; each release also mints a version-specific DOI listed on the Zenodo record. Machine-readable metadata lives in [CITATION.cff](CITATION.cff) and [codemeta.json](codemeta.json); release process is in [docs/zenodo-doi.md](docs/zenodo-doi.md).

```bibtex
@software{puspus_awb_2026,
author = {Puspus, Xavier},
title = {{AWB: AI Workflow Benchmark}},
version = {1.3.0},
version = {1.4.0},
year = {2026},
month = may,
publisher = {Zenodo},
doi = {10.5281/zenodo.20361438},
url = {https://doi.org/10.5281/zenodo.20361438}
doi = {10.5281/zenodo.20361437},
url = {https://doi.org/10.5281/zenodo.20361437}
}
```

Expand Down
2 changes: 1 addition & 1 deletion awb/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""AI Workflow Benchmark - measure tool+workflow performance."""

__version__ = "1.3.0"
__version__ = "1.4.0"
14 changes: 8 additions & 6 deletions awb/adapters/aider.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Aider CLI adapter (placeholder)."""
"""Aider CLI adapter."""

from __future__ import annotations

Expand All @@ -10,15 +10,17 @@
class AiderAdapter(ToolAdapter):
"""Aider CLI adapter.

Aider has a documented CLI: `aider --message <prompt> --yes --no-stream`.
A full implementation lives in this file (see execute() below); set
is_stub = False once Aider is installed and you have validated the
end-to-end run.
Real implementation over Aider's documented CLI:
`aider --message <prompt> --yes --no-stream`. `check_available()` gates on
the binary, so a missing Aider reports unavailable rather than crashing.
Note: `--no-stream` means Aider emits no incremental tool events, so its
`.trace.jsonl` has no gradeable spans and trace-grade columns show n/a for
Aider runs (see grade_trace_or_none).
"""

name = "aider"
display_name = "Aider"
is_stub = True # flip to False once validated end-to-end on real tasks
is_stub = False

async def execute(
self,
Expand Down
81 changes: 28 additions & 53 deletions awb/commands/leaderboard_cmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,23 @@

from awb.commands._shared import INFO, MUTED, console, emit_json, score_style

# Named constants for the heuristic mapping from RunResult fields onto the
# 7 readiness dimensions. Extract so they're documented + tunable in one place.
REVIEW_BURDEN_FILES_TO_ZERO = 50.0 # ~50 modified files -> ~0 review-burden score
MAINTAINABILITY_LINT_TO_ZERO = 20.0 # 20+ new lint warnings -> ~0 maintainability
COST_USD_TO_ZERO = 5.0 # $5 per task -> ~0 cost score
SPEED_SECONDS_TO_ZERO = 1800.0 # 30 min -> ~0 speed score
# Heuristic thresholds for mapping RunResult fields onto the 7 readiness
# dimensions. Single source of truth lives in awb.scoring.readiness; re-exported
# here because this is where they're documented + tuned.
from awb.scoring.readiness import ( # noqa: E402
COST_USD_TO_ZERO,
MAINTAINABILITY_LINT_TO_ZERO,
REVIEW_BURDEN_FILES_TO_ZERO,
SPEED_SECONDS_TO_ZERO,
)

__all__ = [
"COST_USD_TO_ZERO",
"MAINTAINABILITY_LINT_TO_ZERO",
"REVIEW_BURDEN_FILES_TO_ZERO",
"SPEED_SECONDS_TO_ZERO",
"leaderboard",
]


@dataclass
Expand Down Expand Up @@ -77,7 +88,7 @@ def leaderboard(output_dir: str | None, readiness: bool, explain: bool, fmt: str

def _compute_readiness_scores() -> list[ToolReadiness]:
from awb.core.results import ResultRecorder
from awb.scoring.readiness import compute_readiness_score
from awb.scoring.readiness import readiness_from_results

recorder = ResultRecorder()
runs = recorder.load_all_runs()
Expand All @@ -87,55 +98,19 @@ def _compute_readiness_scores() -> list[ToolReadiness]:
by_tool.setdefault(r.tool, []).append(r)
out: list[ToolReadiness] = []
for tool in sorted(by_tool):
results = by_tool[tool]
n = len(results) or 1

def _mean(fn, _results=results, _n=n):
return sum(fn(r) for r in _results) / _n

correctness = 100.0 * sum(1 for r in results if r.outcome.success) / n
regression_safety = 100.0 * sum(1 for r in results if r.quality.test_regressions == 0) / n
security = 100.0 * sum(1 for r in results if r.quality.security_delta <= 0) / n
review_burden = max(
0.0,
100.0 - 100.0 * _mean(lambda r: r.metrics.files_modified) / REVIEW_BURDEN_FILES_TO_ZERO,
)
maintainability = max(
0.0,
100.0
- 100.0
* max(0.0, _mean(lambda r: r.quality.lint_delta))
/ MAINTAINABILITY_LINT_TO_ZERO,
)
cost_score = max(
0.0,
100.0 - 100.0 * _mean(lambda r: r.cost.estimated_cost_usd) / COST_USD_TO_ZERO,
)
speed = max(
0.0,
100.0 - 100.0 * _mean(lambda r: r.metrics.wall_clock_seconds) / SPEED_SECONDS_TO_ZERO,
)
composite = compute_readiness_score(
correctness=correctness,
regression_safety=regression_safety,
security=security,
review_burden=review_burden,
maintainability=maintainability,
cost=cost_score,
speed=speed,
)
d = readiness_from_results(by_tool[tool])
out.append(
ToolReadiness(
tool=tool,
n_results=n,
composite=composite,
correctness=correctness,
regression_safety=regression_safety,
security=security,
review_burden=review_burden,
maintainability=maintainability,
cost=cost_score,
speed=speed,
n_results=d["n_results"] or 1,
composite=d["composite"],
correctness=d["correctness"],
regression_safety=d["regression_safety"],
security=d["security"],
review_burden=d["review_burden"],
maintainability=d["maintainability"],
cost=d["cost"],
speed=d["speed"],
)
)
out.sort(key=lambda s: -s.composite)
Expand Down
10 changes: 8 additions & 2 deletions awb/commands/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,11 +185,17 @@ def _run_both(
@click.option("--capability", "-cap", help="Filter tasks by capability (e.g., security_awareness)")
@click.option("--difficulty", "-d", help="Filter tasks by difficulty (easy, medium, hard)")
@click.option("--runs", "-n", default=3, help="Number of runs per task")
@click.option("--parallel", is_flag=True, help="Run tasks in parallel")
@click.option("--parallel", is_flag=True, help="Run tasks in parallel (fans out to 4 by default)")
@click.option("--dry-run", is_flag=True, help="Validate without executing")
@click.option("--timeout", type=int, help="Override timeout (seconds)")
@click.option("--resume", is_flag=True, help="Skip tasks that already have results")
@click.option("-j", "--concurrency", type=int, default=4, help="Max parallel tasks (default: 4)")
@click.option(
"-j",
"--concurrency",
type=int,
default=1,
help="Max parallel tasks; -j>1 enables parallel mode (default: 1, sequential)",
)
@click.option("--adaptive", is_flag=True, help="Only re-run near-miss tasks on runs 2+")
@click.option("--progressive", is_flag=True, help="Run easy first, stop early if failing")
@click.option("--fast-check", is_flag=True, help="Run 8 representative tasks for quick signal")
Expand Down
Loading
Loading