xmpuspus · xmpuspus · May 30, 2026 · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,53 @@
 # Changelog
 
+## 1.4.0 (2026-05-30)
+
+Trust-fix release from a fresh product audit. The headline differentiator —
+deterministic trace grading — was scoring 100 on every run because the runner
+never emitted the spans the rubrics needed. This release makes the grader
+actually grade (validated against a real fast-check run), fixes two silent
+data-loss bugs in the runner, and tightens the storefront.
+
+### Fixed
+
+- **Trace grader was vacuous in production.** The runner only emitted
+  `LLM_REQUEST` spans (plus a legacy top-level `tool_use` path real Claude Code
+  never produces), so all four trace-grade rubrics fell through to their
+  trivial-pass branches and scored 100 on every run. A new `TraceTranslator`
+  walks the nested `tool_use` blocks in `assistant` content, emits
+  `FILE_EDIT` / read `TOOL_USE` / `SHELL_COMMAND` spans, and correlates Bash
+  exit codes from `tool_result` events. File paths are relativized against the
+  workspace so `no_out_of_scope_edits` matches `files_to_examine`.
+- **`-j N` was a silent no-op** without `--parallel`. `-j>1` now enables
+  parallel mode on its own; `--parallel` alone fans out to 4; default stays
+  sequential (`-j 1`).
+- **Parallel-task crashes vanished from results.** A task that raised on the
+  parallel path was only logged; it is now recorded as a FAIL with a
+  traceback, and stub/usage errors abort the run like the sequential path.
+
+### Added
+
+- **Baseline export carries trust columns**: per-run `trace_grade` (null when a
+  trace has no gradeable spans, so non-streaming tools don't get a fake 100)
+  and a submission-level `readiness` block + `trace_summary`. The published
+  `claude-code-custom-1.4.0-fast-check.json` now ships real, discriminating
+  trace grades (`no_out_of_scope_edits` ranges 17-100 across the 8 tasks).
+- **`grade_trace_or_none`** distinguishes a span-less trace from a genuinely
+  perfect one. `readiness_from_results` is shared by the leaderboard and export.
+- **Shell-execution trust boundary documented** in `docs/SECURITY.md`, with
+  per-task Docker isolation scoped as the next step toward community submissions.
+
+### Changed
+
+- **Aider is a real adapter** (`is_stub = False`); it gates on the binary being
+  installed. Aider's `--no-stream` means no trace spans, so its trace columns
+  report `null`.
+- **Runtime dependencies are exact-pinned** for reproducible installs.
+- **Trace `file.path` is relativized through symlinked workspaces** (macOS
+  `/tmp` -> `/private/tmp`), and `no_out_of_scope_edits` honors directory
+  entries (`tests/`) in `files_to_examine` instead of exact-set membership.
+- README refreshed to v1.4.0 (lead, install pin, baseline reference).
+
 ## 1.3.0 (2026-05-24)
 
 Storefront and trust release: fixes credibility leaks in the docs, adds real

diff --git a/CITATION.cff b/CITATION.cff
@@ -30,5 +30,5 @@ keywords:
   - workflow-evaluation
   - shipping-discipline
 license: MIT
-version: "1.2.0"
-date-released: "2026-04-27"
+version: "1.4.0"
+date-released: "2026-05-30"
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@
   <br/>
   <img src="demos/hero.gif" alt="awb run, awb leaderboard --readiness, awb gap output" width="820"/>
   <br/>
-  <sub>v1.2.0: task-set hash, OpenTelemetry-aligned trace artifacts, <code>awb trace grade</code>, and the Production Readiness Score.</sub>
+  <sub>v1.4.0: trace grading that actually grades, baselines with trust columns, real <code>-j</code> parallelism, and a documented security boundary.</sub>
 </div>
 
 ---
@@ -29,15 +29,16 @@ Related work measures complementary axes. [HAL](https://arxiv.org/abs/2510.11977
 
 AWB's distinct contribution is twofold: (1) a paired **vanilla-vs-custom** adapter pair that isolates the workflow-configuration delta for the same model, surfaced as a single Workflow Lift score with a sign-test p-value; (2) **deterministic** trace-grading rubrics (read-tests-before-edit, ran-verification-after-change, no-out-of-scope-edits, no-repeated-failing-loop) computed from OpenTelemetry-aligned `.trace.jsonl` artifacts, not LLM judges. See [METHODOLOGY.md#related-work](METHODOLOGY.md#related-work) for citation details.
 
-## What's New in v1.2.0
+## What's New in v1.4.0
 
-- **Task-set hash on every result** (`task_set_hash`, SHA-256 over the bundled task YAMLs) so you can prove which exact task set produced a given score.
-- **OpenTelemetry-aligned trace artifact**: every benchmark run writes a `.trace.jsonl` file using OTel GenAI semantic conventions (`gen_ai.client.operation`, `gen_ai.tool.use`) plus AWB-specific spans for shell commands, file edits, and test runs.
-- **`awb trace grade <run_dir>`** scores four shipping disciplines from the trace: read-tests-before-edit, ran-verification-after-change, no-out-of-scope-edits, no-repeated-failing-command-loop.
-- **Production Readiness Score** (`awb leaderboard --readiness`): weighted composite over 7 dimensions, calibrated for shipping safety.
-- **Strict result schema v2** (`additionalProperties: false`, required `schema_version`, `task_set_hash`). `awb migrate-results` extends to v1→v2.
-- **Task provenance fields**: optional `provenance.{source_pr_url, created_at, last_verified_at}`, `contamination_risk`, `label` (`real_pr` / `synthetic_overlay` / `mutated` / `fresh`) lay the groundwork for fresh-task harvesting in v1.3.
-- **Seven P0 trust blockers fixed**: token-budget fields now parse correctly, workflow schema enum matches the 9-adapter registry, setup cache key is order-sensitive, missing security scanners surface a warning instead of silently passing as clean, adapter `on_event` is properly typed as `Callable`, gap analysis classifies `regression_introduced` and `no_edits_made` separately.
+- **Trace grading actually grades now.** The runner used to emit only token spans, so all four trace rubrics scored a vacuous 100 on every run. It now translates Claude Code's nested `tool_use` blocks into `FILE_EDIT` / read / `SHELL_COMMAND` spans and correlates Bash exit codes — producing real, discriminating scores (the published baseline's `no_out_of_scope_edits` ranges 17-100 across the 8 tasks).
+- **Baselines carry the trust columns**: per-run `trace_grade` and a submission-level `readiness` + `trace_summary` block, so the published baseline showcases both flagship trust features. A span-less trace reports `null`, never a fake 100.
+- **`-j N` works on its own.** `-j>1` enables parallel mode (it used to be a silent no-op without `--parallel`); a crashed parallel task is now recorded as a FAIL with a traceback instead of vanishing from results.
+- **Aider is a real adapter** (`is_stub = False`), gated on the binary being installed.
+- **Exact-pinned runtime dependencies** for reproducible installs, plus invariant guard tests so the README lead, install pin, and baseline reference can't silently drift.
+- **Security posture documented** in [docs/SECURITY.md](docs/SECURITY.md): the shell-execution trust boundary and the per-task Docker isolation planned for safe community submissions.
+
+Carried over from v1.2.0-v1.3.0: public GitHub Pages leaderboard, task-set hash on every result, OpenTelemetry-aligned `.trace.jsonl` artifacts, `awb trace grade`, the Production Readiness Score, strict result schema v2, and reliability + provenance hardening.
 
 ## Quick Start
 
@@ -54,18 +55,18 @@ awb leaderboard --readiness --explain                 # Production Readiness Sco
 
 ### Five-minute reproducible demo
 
-Run this end-to-end against the published v1.2.0 baseline. Should finish in ~15 minutes for ~$4 of API spend and produce a tweetable Workflow Lift number plus a capability profile.
+Run this end-to-end against the published v1.4.0 fast-check baseline. Should finish in ~15 minutes for ~$4 of API spend and produce a tweetable Workflow Lift number plus a capability profile.
 
 ```bash
-pip install awb==1.2.0
+pip install awb==1.4.0
 awb quickstart                                       # 1. verify environment
 awb warmup --use-uv                                  # 2. pre-build templates
 awb run --fast-check claude-code-custom              # 3. ~15 min, ~$4, real run
 awb leaderboard --readiness --explain                # 4. composite readiness score
 awb trace grade results/runs/<run_id>/               # 5. behavior rubric scores
 ```
 
-Compare against the published baseline at `results/baselines/claude-code-custom-1.2.0.json`. Same `task_set_hash` means your numbers are directly comparable.
+Compare against the published baseline at `results/baselines/claude-code-custom-1.4.0-fast-check.json`. Same `task_set_hash` means your numbers are directly comparable.
 
 **New in v1.1.0:** `awb warmup` caches workspaces for 10-30x faster setup. `--fast-check` gives a quick signal in 15 min for ~$4. `--progressive` stops early on weak tools. `--use-uv` swaps pip for uv. See [Execution Modes](#execution-modes) below.
 
@@ -83,6 +84,8 @@ Clone repo at pinned SHA
 
 Each task starts from a fresh `git clone` at a pinned commit. Every tool gets the same prompt, the same timeout, and the same verification suite. Results are scored with sigmoid normalization so scores are never negative and never collapse at the boundary.
 
+> **Security:** AWB clones third-party repos and runs their setup/test code plus the AI tool with no sandbox. Treat task sets and their repos as trusted input and run in a disposable environment. See [docs/SECURITY.md](docs/SECURITY.md) for the trust boundary and the planned per-task Docker isolation.
+
 ## Scoring System
 
 Seven dimensions, sigmoid-normalized with per-task baselines derived from difficulty:
@@ -518,18 +521,18 @@ See [CHANGELOG.md](CHANGELOG.md) for the full history (v1.0.0, v0.5.x, v0.4.x, v
 
 ## Citing AWB
 
-If you use AWB in research, cite the version-specific Zenodo DOI. The concept DOI [`10.5281/zenodo.20361437`](https://doi.org/10.5281/zenodo.20361437) always resolves to the latest release; the version DOI below pins to v1.3.0 specifically. Machine-readable metadata lives in [CITATION.cff](CITATION.cff) and [codemeta.json](codemeta.json); release process is in [docs/zenodo-doi.md](docs/zenodo-doi.md).
+If you use AWB in research, cite it via Zenodo. The concept DOI [`10.5281/zenodo.20361437`](https://doi.org/10.5281/zenodo.20361437) always resolves to the latest release; each release also mints a version-specific DOI listed on the Zenodo record. Machine-readable metadata lives in [CITATION.cff](CITATION.cff) and [codemeta.json](codemeta.json); release process is in [docs/zenodo-doi.md](docs/zenodo-doi.md).
 
 ```bibtex
 @software{puspus_awb_2026,
   author    = {Puspus, Xavier},
   title     = {{AWB: AI Workflow Benchmark}},
-  version   = {1.3.0},
+  version   = {1.4.0},
   year      = {2026},
   month     = may,
   publisher = {Zenodo},
-  doi       = {10.5281/zenodo.20361438},
-  url       = {https://doi.org/10.5281/zenodo.20361438}
+  doi       = {10.5281/zenodo.20361437},
+  url       = {https://doi.org/10.5281/zenodo.20361437}
 }
 ```
 

diff --git a/awb/__init__.py b/awb/__init__.py
@@ -1,3 +1,3 @@
 """AI Workflow Benchmark - measure tool+workflow performance."""
 
-__version__ = "1.3.0"
+__version__ = "1.4.0"
diff --git a/awb/adapters/aider.py b/awb/adapters/aider.py
@@ -1,4 +1,4 @@
-"""Aider CLI adapter (placeholder)."""
+"""Aider CLI adapter."""
 
 from __future__ import annotations
 
@@ -10,15 +10,17 @@
 class AiderAdapter(ToolAdapter):
     """Aider CLI adapter.
 
-    Aider has a documented CLI: `aider --message <prompt> --yes --no-stream`.
-    A full implementation lives in this file (see execute() below); set
-    is_stub = False once Aider is installed and you have validated the
-    end-to-end run.
+    Real implementation over Aider's documented CLI:
+    `aider --message <prompt> --yes --no-stream`. `check_available()` gates on
+    the binary, so a missing Aider reports unavailable rather than crashing.
+    Note: `--no-stream` means Aider emits no incremental tool events, so its
+    `.trace.jsonl` has no gradeable spans and trace-grade columns show n/a for
+    Aider runs (see grade_trace_or_none).
     """
 
     name = "aider"
     display_name = "Aider"
-    is_stub = True  # flip to False once validated end-to-end on real tasks
+    is_stub = False
 
     async def execute(
         self,

diff --git a/awb/commands/leaderboard_cmd.py b/awb/commands/leaderboard_cmd.py
@@ -11,12 +11,23 @@
 
 from awb.commands._shared import INFO, MUTED, console, emit_json, score_style
 
-# Named constants for the heuristic mapping from RunResult fields onto the
-# 7 readiness dimensions. Extract so they're documented + tunable in one place.
-REVIEW_BURDEN_FILES_TO_ZERO = 50.0  # ~50 modified files -> ~0 review-burden score
-MAINTAINABILITY_LINT_TO_ZERO = 20.0  # 20+ new lint warnings -> ~0 maintainability
-COST_USD_TO_ZERO = 5.0  # $5 per task -> ~0 cost score
-SPEED_SECONDS_TO_ZERO = 1800.0  # 30 min -> ~0 speed score
+# Heuristic thresholds for mapping RunResult fields onto the 7 readiness
+# dimensions. Single source of truth lives in awb.scoring.readiness; re-exported
+# here because this is where they're documented + tuned.
+from awb.scoring.readiness import (  # noqa: E402
+    COST_USD_TO_ZERO,
+    MAINTAINABILITY_LINT_TO_ZERO,
+    REVIEW_BURDEN_FILES_TO_ZERO,
+    SPEED_SECONDS_TO_ZERO,
+)
+
+__all__ = [
+    "COST_USD_TO_ZERO",
+    "MAINTAINABILITY_LINT_TO_ZERO",
+    "REVIEW_BURDEN_FILES_TO_ZERO",
+    "SPEED_SECONDS_TO_ZERO",
+    "leaderboard",
+]
 
 
 @dataclass
@@ -77,7 +88,7 @@ def leaderboard(output_dir: str | None, readiness: bool, explain: bool, fmt: str
 
 def _compute_readiness_scores() -> list[ToolReadiness]:
     from awb.core.results import ResultRecorder
-    from awb.scoring.readiness import compute_readiness_score
+    from awb.scoring.readiness import readiness_from_results
 
     recorder = ResultRecorder()
     runs = recorder.load_all_runs()
@@ -87,55 +98,19 @@ def _compute_readiness_scores() -> list[ToolReadiness]:
             by_tool.setdefault(r.tool, []).append(r)
     out: list[ToolReadiness] = []
     for tool in sorted(by_tool):
-        results = by_tool[tool]
-        n = len(results) or 1
-
-        def _mean(fn, _results=results, _n=n):
-            return sum(fn(r) for r in _results) / _n
-
-        correctness = 100.0 * sum(1 for r in results if r.outcome.success) / n
-        regression_safety = 100.0 * sum(1 for r in results if r.quality.test_regressions == 0) / n
-        security = 100.0 * sum(1 for r in results if r.quality.security_delta <= 0) / n
-        review_burden = max(
-            0.0,
-            100.0 - 100.0 * _mean(lambda r: r.metrics.files_modified) / REVIEW_BURDEN_FILES_TO_ZERO,
-        )
-        maintainability = max(
-            0.0,
-            100.0
-            - 100.0
-            * max(0.0, _mean(lambda r: r.quality.lint_delta))
-            / MAINTAINABILITY_LINT_TO_ZERO,
-        )
-        cost_score = max(
-            0.0,
-            100.0 - 100.0 * _mean(lambda r: r.cost.estimated_cost_usd) / COST_USD_TO_ZERO,
-        )
-        speed = max(
-            0.0,
-            100.0 - 100.0 * _mean(lambda r: r.metrics.wall_clock_seconds) / SPEED_SECONDS_TO_ZERO,
-        )
-        composite = compute_readiness_score(
-            correctness=correctness,
-            regression_safety=regression_safety,
-            security=security,
-            review_burden=review_burden,
-            maintainability=maintainability,
-            cost=cost_score,
-            speed=speed,
-        )
+        d = readiness_from_results(by_tool[tool])
         out.append(
             ToolReadiness(
                 tool=tool,
-                n_results=n,
-                composite=composite,
-                correctness=correctness,
-                regression_safety=regression_safety,
-                security=security,
-                review_burden=review_burden,
-                maintainability=maintainability,
-                cost=cost_score,
-                speed=speed,
+                n_results=d["n_results"] or 1,
+                composite=d["composite"],
+                correctness=d["correctness"],
+                regression_safety=d["regression_safety"],
+                security=d["security"],
+                review_burden=d["review_burden"],
+                maintainability=d["maintainability"],
+                cost=d["cost"],
+                speed=d["speed"],
             )
         )
     out.sort(key=lambda s: -s.composite)

diff --git a/awb/commands/run.py b/awb/commands/run.py
@@ -185,11 +185,17 @@ def _run_both(
 @click.option("--capability", "-cap", help="Filter tasks by capability (e.g., security_awareness)")
 @click.option("--difficulty", "-d", help="Filter tasks by difficulty (easy, medium, hard)")
 @click.option("--runs", "-n", default=3, help="Number of runs per task")
-@click.option("--parallel", is_flag=True, help="Run tasks in parallel")
+@click.option("--parallel", is_flag=True, help="Run tasks in parallel (fans out to 4 by default)")
 @click.option("--dry-run", is_flag=True, help="Validate without executing")
 @click.option("--timeout", type=int, help="Override timeout (seconds)")
 @click.option("--resume", is_flag=True, help="Skip tasks that already have results")
-@click.option("-j", "--concurrency", type=int, default=4, help="Max parallel tasks (default: 4)")
+@click.option(
+    "-j",
+    "--concurrency",
+    type=int,
+    default=1,
+    help="Max parallel tasks; -j>1 enables parallel mode (default: 1, sequential)",
+)
 @click.option("--adaptive", is_flag=True, help="Only re-run near-miss tasks on runs 2+")
 @click.option("--progressive", is_flag=True, help="Run easy first, stop early if failing")
 @click.option("--fast-check", is_flag=True, help="Run 8 representative tasks for quick signal")