Skip to content

v1.4.0 — real trace grading

Latest

Choose a tag to compare

@xmpuspus xmpuspus released this 30 May 12:50
49fe01c

Audit-driven trust-fix release.

Fixed

  • Trace grader was vacuous: the runner only emitted token spans, so all 4 rubrics scored 100 on every run. It now translates Claude Code tool events into FILE_EDIT/read/SHELL_COMMAND spans (correlating Bash exit codes), with path relativization through symlinked workspaces. Validated against a real fast-check run.
  • -j N was a silent no-op without --parallel; now enables parallel mode, and crashed parallel tasks are recorded as FAILs instead of vanishing.
  • no_out_of_scope_edits honors tests/ directory entries in files_to_examine.

Added

  • Baselines carry per-run trace_grade + submission readiness/trace_summary (null for span-less traces, no fake 100s). Published claude-code-custom-1.4.0-fast-check.json ships real discriminating grades.
  • docs/SECURITY.md: shell-exec trust boundary + scoped per-task Docker isolation.

Changed

  • Aider is a real adapter; runtime deps exact-pinned; invariant guard tests for storefront drift.

274 tests, ruff clean.