Skip to content

v1.4.0: make trace grading real, fix -j parallelism, document security boundary#7

Merged
xmpuspus merged 9 commits into
mainfrom
v140-audit-fixes
May 30, 2026
Merged

v1.4.0: make trace grading real, fix -j parallelism, document security boundary#7
xmpuspus merged 9 commits into
mainfrom
v140-audit-fixes

Conversation

@xmpuspus

Copy link
Copy Markdown
Owner

A fresh product audit found AWB's headline differentiator — deterministic trace grading — was scoring a vacuous 100 on every run because the runner never emitted the spans the rubrics needed. This release makes the grader actually grade (validated against a real fast-check run), fixes two silent data-loss bugs, and tightens the storefront.

Fixed

  • Trace grader was vacuous. Runner emitted only LLM_REQUEST spans, so all 4 rubrics hit their trivial-pass branch and returned 100 every time. New TraceTranslator walks nested tool_use blocks in assistant content, emits FILE_EDIT/read/SHELL_COMMAND spans, correlates Bash exit codes from tool_result, and relativizes paths through symlinked workspaces (/tmp/private/tmp on macOS).
  • no_out_of_scope_edits honored only exact-set membership; now treats tests/ directory entries in files_to_examine as prefixes.
  • -j N was a silent no-op without --parallel; -j>1 now enables parallel mode, and a crashed parallel task is recorded as a FAIL with a traceback instead of vanishing.

Added

  • Baseline export carries per-run trace_grade + submission readiness/trace_summary. grade_trace_or_none reports null for span-less traces (no fake 100s). The published claude-code-custom-1.4.0-fast-check.json ships real, discriminating grades (no_out_of_scope_edits 17-100 across 8 tasks).
  • docs/SECURITY.md: shell-exec trust boundary + scoped per-task Docker isolation.

Changed

  • Aider is a real adapter (is_stub=False); runtime deps exact-pinned; invariant guard tests so README lead / install pin / baseline reference can't drift.

Verification

  • 274 tests pass (was 246), ruff clean.
  • Real fast-check run (6/8 pass, via claude -p) confirms traces now carry tool spans and grades are discriminating.

Reviewed by Xavier Puspus

xmpuspus added 9 commits May 30, 2026 18:14
The runner only emitted LLM_REQUEST spans (and a legacy top-level tool_use
path that real Claude Code never produces), so all four trace-grade rubrics
fell through to their trivial-pass branches and scored 100 on every run.

Add TraceTranslator: walks nested tool_use blocks in assistant content,
emits FILE_EDIT / read TOOL_USE / SHELL_COMMAND spans, and correlates Bash
exit codes from tool_result events. File paths are relativized against the
workspace so no_out_of_scope_edits matches files_to_examine. Trace grading
now reflects actual agent behavior.
-j was a silent no-op without --parallel. Now -j>1 enables parallel mode on
its own, --parallel alone fans out to 4, and the default stays sequential
(-j 1). The parallel path paired gather() exceptions only into a log line, so
a crashed task vanished from results; it now becomes a recorded FAIL with a
traceback, and stub/usage errors abort the run like the sequential path.
README led with v1.2.0, pinned pip install awb==1.2.0, and pointed at a
baseline file that no longer exists. Update the lead, What's New, hero
subtitle, and demo block to v1.3.0 and the real committed fast-check
baseline. Exact-pin the 6 runtime deps so a reproducibility benchmark
resolves identically on every install. Add invariant guard tests so the
lead version, install pin, baseline reference, and exact-pinning can't
silently drift again.
Add grade_trace_or_none so a span-less trace (a non-streaming tool, or a run
predating the span fix) reports null instead of a misleading perfect 100.
build_submission now embeds per-run trace_grade and a submission-level
readiness block + trace_summary, so a regenerated baseline showcases both
trust features. readiness_from_results is shared by the leaderboard and
export. Flip aider is_stub=False (it ships a real CLI execute; gates on the
binary). Regenerate the committed fast-check baseline through the new builder:
real readiness composite, null trace grades (those traces predate the fix).
AWB clones third-party repos and runs their setup/test code plus the AI tool
with no sandbox. docs/SECURITY.md states the trust boundary explicitly, lists
precautions, and scopes per-task Docker isolation as the v1.4 path to safe
community submissions. README links it; CHANGELOG records the unreleased work.
…tching

Two bugs surfaced by the first real run with tool spans flowing:
- file.path kept absolute paths because the workspace lives under /tmp (a
  symlink to /private/tmp on macOS) while the agent reports the resolved path,
  so the literal prefix-strip missed. _rel now also tries realpath(root).
- no_out_of_scope_edits used exact-set membership, so an edit to tests/foo.py
  never matched a 'tests/' directory entry in files_to_examine. Add
  _path_in_scope with trailing-slash directory prefixes.
Real claude-code-custom fast-check run (6/8 pass, via claude -p subscription)
re-exported through the fixed builder. Trace grades are now populated and
discriminating: no_out_of_scope_edits ranges 17-100 (MF-001=17 created
httpx/_cache.py + edited __init__.py, both outside its declared scope),
read_tests_before_edit 25%, readiness composite 86.0. Replaces the prior
null trace grades that came from pre-fix traces.
Bump to 1.4.0. Cut the CHANGELOG Unreleased section to 1.4.0; refresh README
What's New + hero + install pin + baseline reference to 1.4.0; bring
CITATION.cff/codemeta current (were stale at 1.2.0); cite the always-latest
concept DOI. Rename the published baseline to 1.4.0 (stamps awb_version 1.4.0)
with real discriminating trace grades.
@xmpuspus xmpuspus merged commit 49fe01c into main May 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant