feat(session): tolerate throughput-only config diffs on --resume#4
Open
ethan-scitix wants to merge 10 commits into
Open
feat(session): tolerate throughput-only config diffs on --resume#4ethan-scitix wants to merge 10 commits into
ethan-scitix wants to merge 10 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Effective-config strict-match now strips throughput/orchestration knobs (concurrency, retries, buffering, profiling, progress) before comparison and rewrites the body with new values while preserving the original header. Result- and disk-layout-affecting fields stay strict; infer_plans stays byte-for-byte strict. Also extends _strip_throughput_fields to strip top-level result_dir (a nonmatch field — the resume target's own location, never affects what results are produced). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rable_fields Also strips top-level result_dir (non-comparable location), so the name now reflects scope; document the result_dir-strip dependency in two resume tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
max_retries, profile_*, detect_anomalies*, and the progress.json dump (dump_progress / progress_dump_interval) write or change on-disk content — the failure-signal FAILED record, the profiler summary, the anomaly report, the progress file — so they must match on resume. The adjustable set is now only pure scheduling (concurrency, shard-I/O parallelism, write-buffer timing) plus console-only progress (show_progress, log cadence), which never touch disk. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three gaps in the --resume throughput carve-out, found in review: - Strip throughput keys from the top-level `runner_config` defaults block, not just `tasks.*.runner_config`. Real leaderboard configs set `concurrency_limits` there, so bumping it on resume wrongly aborted — the carve-out only covered the per-task and top-level-scalar forms. - Abort cleanly when a tampered persisted file parses to a non-mapping: guard `isinstance(dict)` so the strip can't raise a bare AttributeError, keeping RuntimeError the only failure the caller observes. - On a tolerated rewrite, append a timestamped `Resumed by ...` record of the changed fields to the header instead of silently keeping the old one. Origin provenance is preserved and the lineage accumulates across resumes; result_dir (a reification-injected location field) is excluded as noise. Extract `_diff_lines` from `_diff_dicts` and add `_append_resume_note` (inserts inside the border pair so `_split_header` stays correct). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5aef482 to
7df5eff
Compare
Clarify the resume strict-match annotations and condense wordy/duplicated docstrings around the throughput carve-out. Comment-only — no behavior change. - _NONMATCH_RUNNER_KEYS: replace the overstated "Never compared" note with what the strip actually enforces (only top-level result_dir is dropped; the rest are never reached because they don't survive into a persisted runner_config block) and the strict-compare edge if one is hand-authored. - _strip_noncomparable_fields docstring: drop the two-runner_config-locations tail that duplicated the adjacent inline comment; keep the "merged into every task" nuance once, at the code site. - _append_resume_note docstring: tighten lineage/precondition wording. - Persist guard comments: explain the header-less / formatting-only no-note cases, and trim the non-mapping guard's restated RuntimeError contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Type
Summary
--resumestrict-match guard so a resume no longer aborts when only pure scheduling / console-progress knobs differ from the persistedeffective_config.yaml— concurrency (concurrency_limit/concurrency_limits), shard read/write concurrency, write-buffer sizing/flush, and console progress (show_progress, log intervals).concurrency_limit/concurrency_limitsand resume.--forceflag or env var. A field is resume-mutable iff changing it touches neither sample data nor any persisted artifact. Everything affecting on-disk content stays strict — sampling/seeds,max_iterations,shard_samples,record_*,max_retries(the failure signal in FAILED records),profile_*,detect_anomalies*,dump_progress/progress_dump_interval, anddeterministic;infer_plans.yamlstays byte-for-byte strict. A classification-completeness test guards that everyTaskRunnerConfigfield is bucketed (throughput / strict / non-match), so a future field added without classification fails CI.concurrency_limit(s), the top-levelrunner_configdefaults block, per-taskrunner_config, andmodels.*.args.concurrency_limit.effective_config.yamlbody is rewritten with the new values and the header gains an appended# Resumed by sieval <ver> at <T>:record of exactly which fields changed — original provenance is preserved and the lineage accumulates across resumes. A tampered persisted file that no longer parses to a YAML mapping aborts as the sameResume abortedRuntimeError rather than an opaque error.Test Plan
Automated
ruff check && ruff format --check)ty check)pdm run pytest) — 353 passing acrosstests/unit/cli/leaderboard/+tests/integration/resume/, including theTestStrictResumeMatchresume cases and 6 helper test classes (TestRunnerFieldClassification,TestStripNoncomparableFields,TestSplitHeader,TestDiffDicts,TestDiffLines,TestAppendResumeNote).Manual
TestStrictResumeMatchexercises resuming with changed top-level / top-level-runner_config/ per-task / per-model concurrency and console-progress (succeeds, body updated, resume note appended + accumulated across resumes), and aborts on a result-affecting (deterministic) or layout-affecting (shard_samples,max_retries,profile_*) change, and on a non-mapping persisted file. No manual run required.Checklist
Required (all PRs)
type(scope): description)AI-Generated Code - <model> (<provider>)in module docstringcore/(changes confined tocli/)_strip_header,_brief_diff, and_diff_dictsare preserved as thin delegating wrappers (over_split_header/_diff_lines); no call sites broken