v1.4.0: make trace grading real, fix -j parallelism, document security boundary by xmpuspus · Pull Request #7 · xmpuspus/ai-workflow-benchmark

xmpuspus · 2026-05-30T12:48:28Z

A fresh product audit found AWB's headline differentiator — deterministic trace grading — was scoring a vacuous 100 on every run because the runner never emitted the spans the rubrics needed. This release makes the grader actually grade (validated against a real fast-check run), fixes two silent data-loss bugs, and tightens the storefront.

Fixed

Trace grader was vacuous. Runner emitted only LLM_REQUEST spans, so all 4 rubrics hit their trivial-pass branch and returned 100 every time. New TraceTranslator walks nested tool_use blocks in assistant content, emits FILE_EDIT/read/SHELL_COMMAND spans, correlates Bash exit codes from tool_result, and relativizes paths through symlinked workspaces (/tmp→/private/tmp on macOS).
no_out_of_scope_edits honored only exact-set membership; now treats tests/ directory entries in files_to_examine as prefixes.
-j N was a silent no-op without --parallel; -j>1 now enables parallel mode, and a crashed parallel task is recorded as a FAIL with a traceback instead of vanishing.

Added

Baseline export carries per-run trace_grade + submission readiness/trace_summary. grade_trace_or_none reports null for span-less traces (no fake 100s). The published claude-code-custom-1.4.0-fast-check.json ships real, discriminating grades (no_out_of_scope_edits 17-100 across 8 tasks).
docs/SECURITY.md: shell-exec trust boundary + scoped per-task Docker isolation.

Changed

Aider is a real adapter (is_stub=False); runtime deps exact-pinned; invariant guard tests so README lead / install pin / baseline reference can't drift.

Verification

274 tests pass (was 246), ruff clean.
Real fast-check run (6/8 pass, via claude -p) confirms traces now carry tool spans and grades are discriminating.

Reviewed by Xavier Puspus

The runner only emitted LLM_REQUEST spans (and a legacy top-level tool_use path that real Claude Code never produces), so all four trace-grade rubrics fell through to their trivial-pass branches and scored 100 on every run. Add TraceTranslator: walks nested tool_use blocks in assistant content, emits FILE_EDIT / read TOOL_USE / SHELL_COMMAND spans, and correlates Bash exit codes from tool_result events. File paths are relativized against the workspace so no_out_of_scope_edits matches files_to_examine. Trace grading now reflects actual agent behavior.

-j was a silent no-op without --parallel. Now -j>1 enables parallel mode on its own, --parallel alone fans out to 4, and the default stays sequential (-j 1). The parallel path paired gather() exceptions only into a log line, so a crashed task vanished from results; it now becomes a recorded FAIL with a traceback, and stub/usage errors abort the run like the sequential path.

README led with v1.2.0, pinned pip install awb==1.2.0, and pointed at a baseline file that no longer exists. Update the lead, What's New, hero subtitle, and demo block to v1.3.0 and the real committed fast-check baseline. Exact-pin the 6 runtime deps so a reproducibility benchmark resolves identically on every install. Add invariant guard tests so the lead version, install pin, baseline reference, and exact-pinning can't silently drift again.

Add grade_trace_or_none so a span-less trace (a non-streaming tool, or a run predating the span fix) reports null instead of a misleading perfect 100. build_submission now embeds per-run trace_grade and a submission-level readiness block + trace_summary, so a regenerated baseline showcases both trust features. readiness_from_results is shared by the leaderboard and export. Flip aider is_stub=False (it ships a real CLI execute; gates on the binary). Regenerate the committed fast-check baseline through the new builder: real readiness composite, null trace grades (those traces predate the fix).

AWB clones third-party repos and runs their setup/test code plus the AI tool with no sandbox. docs/SECURITY.md states the trust boundary explicitly, lists precautions, and scopes per-task Docker isolation as the v1.4 path to safe community submissions. README links it; CHANGELOG records the unreleased work.

…tching Two bugs surfaced by the first real run with tool spans flowing: - file.path kept absolute paths because the workspace lives under /tmp (a symlink to /private/tmp on macOS) while the agent reports the resolved path, so the literal prefix-strip missed. _rel now also tries realpath(root). - no_out_of_scope_edits used exact-set membership, so an edit to tests/foo.py never matched a 'tests/' directory entry in files_to_examine. Add _path_in_scope with trailing-slash directory prefixes.

Real claude-code-custom fast-check run (6/8 pass, via claude -p subscription) re-exported through the fixed builder. Trace grades are now populated and discriminating: no_out_of_scope_edits ranges 17-100 (MF-001=17 created httpx/_cache.py + edited __init__.py, both outside its declared scope), read_tests_before_edit 25%, readiness composite 86.0. Replaces the prior null trace grades that came from pre-fix traces.

Bump to 1.4.0. Cut the CHANGELOG Unreleased section to 1.4.0; refresh README What's New + hero + install pin + baseline reference to 1.4.0; bring CITATION.cff/codemeta current (were stale at 1.2.0); cite the always-latest concept DOI. Rename the published baseline to 1.4.0 (stamps awb_version 1.4.0) with real discriminating trace grades.

xmpuspus added 9 commits May 30, 2026 18:14

ruff format translate.py

a144258

xmpuspus merged commit 49fe01c into main May 30, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.0: make trace grading real, fix -j parallelism, document security boundary#7

v1.4.0: make trace grading real, fix -j parallelism, document security boundary#7
xmpuspus merged 9 commits into
mainfrom
v140-audit-fixes

xmpuspus commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xmpuspus commented May 30, 2026

Fixed

Added

Changed

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant