r2e_gym: add hide_tests_from_agent flag + surface instance_id/repo in validate() output#1208
Open
r2e_gym: add hide_tests_from_agent flag + surface instance_id/repo in validate() output#1208
Conversation
…ases Two small R2E-Gym quality-of-life improvements surfaced while running the full 4578-row gold-patch validation: * `hide_tests_from_agent: bool = True` — controls whether `setup()` stashes `/r2e_tests` on the host (current behavior, agent-safe) or leaves it in-sandbox via an in-place `mv /r2e_tests /testbed/r2e_tests` (new fast-path). The roundtrip was a ~2-3 min per-row cost dominated by LFS upload/download pinning the sandbox client's thread pool for the entire transfer. For validate/no-agent flows, the fast-path cuts per-row wall time roughly 4x (300s → 70s median for cached images) and unblocks the thread-pool bottleneck — required for fair agent rollouts (keep default True), strictly optional otherwise. * `_process_example` now sets `info["instance_id"] = commit_hash` and `info["repo"] = repo_name` as aliases, so `TaskSet.validate()`'s JSONL output surfaces meaningful identifiers for R2E-Gym rows (previously both fields were None because validate() looks for generic names). No change to the R2E-Gym dataset schema or any consumer that reads the original field names. Validated by a full 4578-row gold-patch run with `hide_tests_from_agent=False`: 4500/4578 pass (98.30%), no setup regressions, throughput ~40 rows/min (vs ~2.4 rows/min baseline).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two quality-of-life improvements to
R2EGymTaskSetsurfaced while running the full 4578-row gold-patch validation:1.
hide_tests_from_agent: bool = TrueControls whether
setup()stashes/r2e_testson the host (current behavior, agent-safe) or leaves it in-sandbox via an in-placemv /r2e_tests /testbed/r2e_tests(new fast-path).Why it matters. The current roundtrip is ~2-3 min per row dominated by the tar →
download_file→rmin setup plus theupload_file→ extract in_run_tests. Each of those blocks aThreadedAsyncSandboxClientworker thread for the entire transfer — andTaskSet.validate()caps worker count atmin(concurrency // 8, 50), so 50 concurrent validate tasks serialize through 6 workers, each pinned on I/O. Under concurrency=50 we measured ~2.4 rows/min; after enabling the fast-path + patching the worker cap we hit ~40 rows/min.Keep the default
Truefor agent rollouts (the hiding is the whole point — R2E-Gym's gold-truth tests can't be visible to the agent). SetFalsewhen the caller knows no agent is involved (e.g.TaskSet.validate()for gold-patch validation / dataset cleanup).2. Expose
instance_id/repoaliases ininfoR2E-Gym rows natively use
commit_hashandrepo_name.TaskSet.validate()looks for genericinfo["instance_id"]andinfo["repo"]to populate each JSONL row's identity fields — with R2E-Gym those were alwaysNone, so the streaming log was unreadable._process_examplenow sets both aliases viasetdefaultso downstream code that reads the original names keeps working.Changes
verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py:__init__: newhide_tests_from_agent: bool = Truekwarg, stored on selfsetup(): branches on the flag — fast-path does a singlemv, slow-path keeps the existing tar/download/delete roundtrip_run_tests(): restoration step now conditional on a cached archive; fast-path skips it (tests already in/testbed/r2e_testsfrom setup)_process_example: addsinstance_idandrepoaliases pointing atcommit_hash/repo_nameValidation
Ran the full 4578-row gold-patch validation (
R2EGymTaskSet(hide_tests_from_agent=False)) plus a retry pass on 15 pillow sandbox_errors:Per-row median elapsed dropped from ~300s (slow-path) to ~60s for cached images (fast-path), and the control set of 30 random passing rows showed zero regressions.
The remaining 62
test_failedare a mix of real gold-patch regressions (~7-10) and dataset-drift artifacts where our env makes tests pass thatexpected_output_jsonhad markedFAILED(~5-10) — visible in future follow-up work but not in scope here.Test plan
hide_tests_from_agent=False🤖 Generated with Claude Code
Note
Medium Risk
Changes how test artifacts are moved between host and sandbox during setup/scoring, which could impact correctness or leak-prevention if misconfigured, though the default preserves current agent-safe behavior.
Overview
Adds a
hide_tests_from_agentflag toR2EGymTaskSetto choose between the existing agent-safe test staging flow (tar/r2e_teststo host, delete in-sandbox, restore on scoring) and a new fast-path for non-agent runs that keeps tests in-sandbox viamv /r2e_tests /testbed/r2e_tests.Also updates
_run_tests()to only restore tests when a cached archive exists, and extends_process_example()to includeinfo["instance_id"]andinfo["repo"]aliases soTaskSet.validate()outputs have stable identifiers.Reviewed by Cursor Bugbot for commit ab65f32. Bugbot is set up for automated code reviews on this repo. Configure here.