v1.4.0: make trace grading real, fix -j parallelism, document security boundary#7
Merged
Conversation
The runner only emitted LLM_REQUEST spans (and a legacy top-level tool_use path that real Claude Code never produces), so all four trace-grade rubrics fell through to their trivial-pass branches and scored 100 on every run. Add TraceTranslator: walks nested tool_use blocks in assistant content, emits FILE_EDIT / read TOOL_USE / SHELL_COMMAND spans, and correlates Bash exit codes from tool_result events. File paths are relativized against the workspace so no_out_of_scope_edits matches files_to_examine. Trace grading now reflects actual agent behavior.
-j was a silent no-op without --parallel. Now -j>1 enables parallel mode on its own, --parallel alone fans out to 4, and the default stays sequential (-j 1). The parallel path paired gather() exceptions only into a log line, so a crashed task vanished from results; it now becomes a recorded FAIL with a traceback, and stub/usage errors abort the run like the sequential path.
README led with v1.2.0, pinned pip install awb==1.2.0, and pointed at a baseline file that no longer exists. Update the lead, What's New, hero subtitle, and demo block to v1.3.0 and the real committed fast-check baseline. Exact-pin the 6 runtime deps so a reproducibility benchmark resolves identically on every install. Add invariant guard tests so the lead version, install pin, baseline reference, and exact-pinning can't silently drift again.
Add grade_trace_or_none so a span-less trace (a non-streaming tool, or a run predating the span fix) reports null instead of a misleading perfect 100. build_submission now embeds per-run trace_grade and a submission-level readiness block + trace_summary, so a regenerated baseline showcases both trust features. readiness_from_results is shared by the leaderboard and export. Flip aider is_stub=False (it ships a real CLI execute; gates on the binary). Regenerate the committed fast-check baseline through the new builder: real readiness composite, null trace grades (those traces predate the fix).
AWB clones third-party repos and runs their setup/test code plus the AI tool with no sandbox. docs/SECURITY.md states the trust boundary explicitly, lists precautions, and scopes per-task Docker isolation as the v1.4 path to safe community submissions. README links it; CHANGELOG records the unreleased work.
…tching Two bugs surfaced by the first real run with tool spans flowing: - file.path kept absolute paths because the workspace lives under /tmp (a symlink to /private/tmp on macOS) while the agent reports the resolved path, so the literal prefix-strip missed. _rel now also tries realpath(root). - no_out_of_scope_edits used exact-set membership, so an edit to tests/foo.py never matched a 'tests/' directory entry in files_to_examine. Add _path_in_scope with trailing-slash directory prefixes.
Real claude-code-custom fast-check run (6/8 pass, via claude -p subscription) re-exported through the fixed builder. Trace grades are now populated and discriminating: no_out_of_scope_edits ranges 17-100 (MF-001=17 created httpx/_cache.py + edited __init__.py, both outside its declared scope), read_tests_before_edit 25%, readiness composite 86.0. Replaces the prior null trace grades that came from pre-fix traces.
Bump to 1.4.0. Cut the CHANGELOG Unreleased section to 1.4.0; refresh README What's New + hero + install pin + baseline reference to 1.4.0; bring CITATION.cff/codemeta current (were stale at 1.2.0); cite the always-latest concept DOI. Rename the published baseline to 1.4.0 (stamps awb_version 1.4.0) with real discriminating trace grades.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A fresh product audit found AWB's headline differentiator — deterministic trace grading — was scoring a vacuous 100 on every run because the runner never emitted the spans the rubrics needed. This release makes the grader actually grade (validated against a real fast-check run), fixes two silent data-loss bugs, and tightens the storefront.
Fixed
LLM_REQUESTspans, so all 4 rubrics hit their trivial-pass branch and returned 100 every time. NewTraceTranslatorwalks nestedtool_useblocks inassistantcontent, emitsFILE_EDIT/read/SHELL_COMMANDspans, correlates Bash exit codes fromtool_result, and relativizes paths through symlinked workspaces (/tmp→/private/tmpon macOS).no_out_of_scope_editshonored only exact-set membership; now treatstests/directory entries infiles_to_examineas prefixes.-j Nwas a silent no-op without--parallel;-j>1now enables parallel mode, and a crashed parallel task is recorded as a FAIL with a traceback instead of vanishing.Added
trace_grade+ submissionreadiness/trace_summary.grade_trace_or_nonereportsnullfor span-less traces (no fake 100s). The publishedclaude-code-custom-1.4.0-fast-check.jsonships real, discriminating grades (no_out_of_scope_edits17-100 across 8 tasks).docs/SECURITY.md: shell-exec trust boundary + scoped per-task Docker isolation.Changed
is_stub=False); runtime deps exact-pinned; invariant guard tests so README lead / install pin / baseline reference can't drift.Verification
claude -p) confirms traces now carry tool spans and grades are discriminating.Reviewed by Xavier Puspus