Skip to content

feat(session): tolerate throughput-only config diffs on --resume#4

Open
ethan-scitix wants to merge 10 commits into
mainfrom
feat/resume-throughput-relaxation
Open

feat(session): tolerate throughput-only config diffs on --resume#4
ethan-scitix wants to merge 10 commits into
mainfrom
feat/resume-throughput-relaxation

Conversation

@ethan-scitix

@ethan-scitix ethan-scitix commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Type

  • feature — new benchmark, task, or capability

Summary

  • Relaxes the --resume strict-match guard so a resume no longer aborts when only pure scheduling / console-progress knobs differ from the persisted effective_config.yaml — concurrency (concurrency_limit / concurrency_limits), shard read/write concurrency, write-buffer sizing/flush, and console progress (show_progress, log intervals).
  • Why: a run that partially fails because the inference service was over-subscribed (rate-limit / OOM) could previously only be recovered by starting completely fresh (discarding all completed samples) or reverting the concurrency change (re-triggering the same failure). Now you lower concurrency_limit / concurrency_limits and resume.
  • The match scope is narrowed, not bypassed: no --force flag or env var. A field is resume-mutable iff changing it touches neither sample data nor any persisted artifact. Everything affecting on-disk content stays strict — sampling/seeds, max_iterations, shard_samples, record_*, max_retries (the failure signal in FAILED records), profile_*, detect_anomalies*, dump_progress / progress_dump_interval, and deterministic; infer_plans.yaml stays byte-for-byte strict. A classification-completeness test guards that every TaskRunnerConfig field is bucketed (throughput / strict / non-match), so a future field added without classification fails CI.
  • Throughput knobs are recognized wherever they live — top-level concurrency_limit(s), the top-level runner_config defaults block, per-task runner_config, and models.*.args.concurrency_limit.
  • On a tolerated diff, the effective_config.yaml body is rewritten with the new values and the header gains an appended # Resumed by sieval <ver> at <T>: record of exactly which fields changed — original provenance is preserved and the lineage accumulates across resumes. A tampered persisted file that no longer parses to a YAML mapping aborts as the same Resume aborted RuntimeError rather than an opaque error.

Test Plan

Automated

  • Lint/format clean (ruff check && ruff format --check)
  • Type check clean (ty check)
  • Unit + integration tests pass (pdm run pytest) — 353 passing across tests/unit/cli/leaderboard/ + tests/integration/resume/, including the TestStrictResumeMatch resume cases and 6 helper test classes (TestRunnerFieldClassification, TestStripNoncomparableFields, TestSplitHeader, TestDiffDicts, TestDiffLines, TestAppendResumeNote).

Manual

  • Covered by automated tests — TestStrictResumeMatch exercises resuming with changed top-level / top-level-runner_config / per-task / per-model concurrency and console-progress (succeeds, body updated, resume note appended + accumulated across resumes), and aborts on a result-affecting (deterministic) or layout-affecting (shard_samples, max_retries, profile_*) change, and on a non-mapping persisted file. No manual run required.

Checklist

Required (all PRs)

  • PR title follows conventional format (type(scope): description)
  • No internal paths, credentials, or personal info in committed files
  • AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
  • No new upper-layer dependencies added to core/ (changes confined to cli/)
  • Deleted code verified — _strip_header, _brief_diff, and _diff_dicts are preserved as thin delegating wrappers (over _split_header / _diff_lines); no call sites broken

ethan-scitix and others added 9 commits June 25, 2026 14:37
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Effective-config strict-match now strips throughput/orchestration knobs
(concurrency, retries, buffering, profiling, progress) before comparison
and rewrites the body with new values while preserving the original
header. Result- and disk-layout-affecting fields stay strict; infer_plans
stays byte-for-byte strict.

Also extends _strip_throughput_fields to strip top-level result_dir
(a nonmatch field — the resume target's own location, never affects
what results are produced).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rable_fields

Also strips top-level result_dir (non-comparable location), so the name now
reflects scope; document the result_dir-strip dependency in two resume tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
max_retries, profile_*, detect_anomalies*, and the progress.json dump
(dump_progress / progress_dump_interval) write or change on-disk content
— the failure-signal FAILED record, the profiler summary, the anomaly
report, the progress file — so they must match on resume. The adjustable
set is now only pure scheduling (concurrency, shard-I/O parallelism,
write-buffer timing) plus console-only progress (show_progress, log
cadence), which never touch disk.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three gaps in the --resume throughput carve-out, found in review:

- Strip throughput keys from the top-level `runner_config` defaults block,
  not just `tasks.*.runner_config`. Real leaderboard configs set
  `concurrency_limits` there, so bumping it on resume wrongly aborted —
  the carve-out only covered the per-task and top-level-scalar forms.
- Abort cleanly when a tampered persisted file parses to a non-mapping:
  guard `isinstance(dict)` so the strip can't raise a bare AttributeError,
  keeping RuntimeError the only failure the caller observes.
- On a tolerated rewrite, append a timestamped `Resumed by ...` record of
  the changed fields to the header instead of silently keeping the old one.
  Origin provenance is preserved and the lineage accumulates across resumes;
  result_dir (a reification-injected location field) is excluded as noise.

Extract `_diff_lines` from `_diff_dicts` and add `_append_resume_note`
(inserts inside the border pair so `_split_header` stays correct).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ethan-scitix ethan-scitix force-pushed the feat/resume-throughput-relaxation branch from 5aef482 to 7df5eff Compare June 25, 2026 06:41
Clarify the resume strict-match annotations and condense wordy/duplicated
docstrings around the throughput carve-out. Comment-only — no behavior change.

- _NONMATCH_RUNNER_KEYS: replace the overstated "Never compared" note with
  what the strip actually enforces (only top-level result_dir is dropped; the
  rest are never reached because they don't survive into a persisted
  runner_config block) and the strict-compare edge if one is hand-authored.
- _strip_noncomparable_fields docstring: drop the two-runner_config-locations
  tail that duplicated the adjacent inline comment; keep the "merged into
  every task" nuance once, at the code site.
- _append_resume_note docstring: tighten lineage/precondition wording.
- Persist guard comments: explain the header-less / formatting-only no-note
  cases, and trim the non-mapping guard's restated RuntimeError contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant