feat(session): tolerate throughput-only config diffs on --resume by ethan-scitix · Pull Request #4 · scitix/sieval

ethan-scitix · 2026-06-17T13:50:06Z

Type

feature — new benchmark, task, or capability

Summary

Relaxes the --resume strict-match guard so a resume no longer aborts when only pure scheduling / console-progress knobs differ from the persisted effective_config.yaml — concurrency (concurrency_limit / concurrency_limits), shard read/write concurrency, write-buffer sizing/flush, and console progress (show_progress, log intervals).
Why: a run that partially fails because the inference service was over-subscribed (rate-limit / OOM) could previously only be recovered by starting completely fresh (discarding all completed samples) or reverting the concurrency change (re-triggering the same failure). Now you lower concurrency_limit / concurrency_limits and resume.
The match scope is narrowed, not bypassed: no --force flag or env var. A field is resume-mutable iff changing it touches neither sample data nor any persisted artifact. Everything affecting on-disk content stays strict — sampling/seeds, max_iterations, shard_samples, record_*, max_retries (the failure signal in FAILED records), profile_*, detect_anomalies*, dump_progress / progress_dump_interval, and deterministic; infer_plans.yaml stays byte-for-byte strict. A classification-completeness test guards that every TaskRunnerConfig field is bucketed (throughput / strict / non-match), so a future field added without classification fails CI.
Throughput knobs are recognized wherever they live — top-level concurrency_limit(s), the top-level runner_config defaults block, per-task runner_config, and models.*.args.concurrency_limit.
On a tolerated diff, the effective_config.yaml body is rewritten with the new values and the header gains an appended # Resumed by sieval <ver> at <T>: record of exactly which fields changed — original provenance is preserved and the lineage accumulates across resumes. A tampered persisted file that no longer parses to a YAML mapping aborts as the same Resume aborted RuntimeError rather than an opaque error.

Test Plan

Automated

Lint/format clean (ruff check && ruff format --check)
Type check clean (ty check)
Unit + integration tests pass (pdm run pytest) — 353 passing across tests/unit/cli/leaderboard/ + tests/integration/resume/, including the TestStrictResumeMatch resume cases and 6 helper test classes (TestRunnerFieldClassification, TestStripNoncomparableFields, TestSplitHeader, TestDiffDicts, TestDiffLines, TestAppendResumeNote).

Manual

Covered by automated tests — TestStrictResumeMatch exercises resuming with changed top-level / top-level-runner_config / per-task / per-model concurrency and console-progress (succeeds, body updated, resume note appended + accumulated across resumes), and aborts on a result-affecting (deterministic) or layout-affecting (shard_samples, max_retries, profile_*) change, and on a non-mapping persisted file. No manual run required.

Checklist

Required (all PRs)

PR title follows conventional format (type(scope): description)
No internal paths, credentials, or personal info in committed files
AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
No new upper-layer dependencies added to core/ (changes confined to cli/)
Deleted code verified — _strip_header, _brief_diff, and _diff_dicts are preserved as thin delegating wrappers (over _split_header / _diff_lines); no call sites broken

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Effective-config strict-match now strips throughput/orchestration knobs (concurrency, retries, buffering, profiling, progress) before comparison and rewrites the body with new values while preserving the original header. Result- and disk-layout-affecting fields stay strict; infer_plans stays byte-for-byte strict. Also extends _strip_throughput_fields to strip top-level result_dir (a nonmatch field — the resume target's own location, never affects what results are produced). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rable_fields Also strips top-level result_dir (non-comparable location), so the name now reflects scope; document the result_dir-strip dependency in two resume tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

max_retries, profile_*, detect_anomalies*, and the progress.json dump (dump_progress / progress_dump_interval) write or change on-disk content — the failure-signal FAILED record, the profiler summary, the anomaly report, the progress file — so they must match on resume. The adjustable set is now only pure scheduling (concurrency, shard-I/O parallelism, write-buffer timing) plus console-only progress (show_progress, log cadence), which never touch disk. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Three gaps in the --resume throughput carve-out, found in review: - Strip throughput keys from the top-level `runner_config` defaults block, not just `tasks.*.runner_config`. Real leaderboard configs set `concurrency_limits` there, so bumping it on resume wrongly aborted — the carve-out only covered the per-task and top-level-scalar forms. - Abort cleanly when a tampered persisted file parses to a non-mapping: guard `isinstance(dict)` so the strip can't raise a bare AttributeError, keeping RuntimeError the only failure the caller observes. - On a tolerated rewrite, append a timestamped `Resumed by ...` record of the changed fields to the header instead of silently keeping the old one. Origin provenance is preserved and the lineage accumulates across resumes; result_dir (a reification-injected location field) is excluded as noise. Extract `_diff_lines` from `_diff_dicts` and add `_append_resume_note` (inserts inside the border pair so `_split_header` stays correct). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Clarify the resume strict-match annotations and condense wordy/duplicated docstrings around the throughput carve-out. Comment-only — no behavior change. - _NONMATCH_RUNNER_KEYS: replace the overstated "Never compared" note with what the strip actually enforces (only top-level result_dir is dropped; the rest are never reached because they don't survive into a persisted runner_config block) and the strict-compare edge if one is hand-authored. - _strip_noncomparable_fields docstring: drop the two-runner_config-locations tail that duplicated the adjacent inline comment; keep the "merged into every task" nuance once, at the code site. - _append_resume_note docstring: tighten lineage/precondition wording. - Persist guard comments: explain the header-less / formatting-only no-note cases, and trim the non-mapping guard's restated RuntimeError contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ethan-scitix and others added 9 commits June 25, 2026 14:37

refactor(session): add _split_header, reduce _strip_header to delegate

704832b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

refactor(session): extract _diff_dicts core from _brief_diff

0fd4075

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(session): add throughput field policy + _strip_throughput_fields

6318719

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: document --resume throughput carve-out in reproducibility clause

d4114f2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(session): condense resume-policy comments and docstrings

e243200

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ethan-scitix force-pushed the feat/resume-throughput-relaxation branch from 5aef482 to 7df5eff Compare June 25, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(session): tolerate throughput-only config diffs on --resume#4

feat(session): tolerate throughput-only config diffs on --resume#4
ethan-scitix wants to merge 10 commits into
mainfrom
feat/resume-throughput-relaxation

ethan-scitix commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ethan-scitix commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type

Summary

Test Plan

Automated

Manual

Checklist

Required (all PRs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ethan-scitix commented Jun 17, 2026 •

edited

Loading