r2e_gym: add hide_tests_from_agent flag + surface instance_id/repo in validate() output by hallerite · Pull Request #1208 · PrimeIntellect-ai/verifiers

hallerite · 2026-04-20T15:37:05Z

Summary

Two quality-of-life improvements to R2EGymTaskSet surfaced while running the full 4578-row gold-patch validation:

1. `hide_tests_from_agent: bool = True`

Controls whether setup() stashes /r2e_tests on the host (current behavior, agent-safe) or leaves it in-sandbox via an in-place mv /r2e_tests /testbed/r2e_tests (new fast-path).

Why it matters. The current roundtrip is ~2-3 min per row dominated by the tar → download_file → rm in setup plus the upload_file → extract in _run_tests. Each of those blocks a ThreadedAsyncSandboxClient worker thread for the entire transfer — and TaskSet.validate() caps worker count at min(concurrency // 8, 50), so 50 concurrent validate tasks serialize through 6 workers, each pinned on I/O. Under concurrency=50 we measured ~2.4 rows/min; after enabling the fast-path + patching the worker cap we hit ~40 rows/min.

Keep the default True for agent rollouts (the hiding is the whole point — R2E-Gym's gold-truth tests can't be visible to the agent). Set False when the caller knows no agent is involved (e.g. TaskSet.validate() for gold-patch validation / dataset cleanup).

2. Expose `instance_id` / `repo` aliases in `info`

R2E-Gym rows natively use commit_hash and repo_name. TaskSet.validate() looks for generic info["instance_id"] and info["repo"] to populate each JSONL row's identity fields — with R2E-Gym those were always None, so the streaming log was unreadable. _process_example now sets both aliases via setdefault so downstream code that reads the original names keeps working.

Changes

verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py:

__init__: new hide_tests_from_agent: bool = True kwarg, stored on self
setup(): branches on the flag — fast-path does a single mv, slow-path keeps the existing tar/download/delete roundtrip
_run_tests(): restoration step now conditional on a cached archive; fast-path skips it (tests already in /testbed/r2e_tests from setup)
_process_example: adds instance_id and repo aliases pointing at commit_hash / repo_name
Docstrings updated to explain the tradeoff

Validation

Ran the full 4578-row gold-patch validation (R2EGymTaskSet(hide_tests_from_agent=False)) plus a retry pass on 15 pillow sandbox_errors:

	Count	Rate
Pass	4515	98.62%
test_failed	62	1.35%
setup_failed	1	0.02%
sandbox_error (post-retry)	0	0%

Per-row median elapsed dropped from ~300s (slow-path) to ~60s for cached images (fast-path), and the control set of 30 random passing rows showed zero regressions.

The remaining 62 test_failed are a mix of real gold-patch regressions (~7-10) and dataset-drift artifacts where our env makes tests pass that expected_output_json had marked FAILED (~5-10) — visible in future follow-up work but not in scope here.

Test plan

Gold-patch validation: 4515/4578 pass (98.62%) with hide_tests_from_agent=False
30-row passing control set: 30/30 still pass (no setup regressions)
Pillow sandbox_error retry: 15/15 pass
CI

🤖 Generated with Claude Code

Note

Medium Risk
Changes how test artifacts are moved between host and sandbox during setup/scoring, which could impact correctness or leak-prevention if misconfigured, though the default preserves current agent-safe behavior.

Overview
Adds a hide_tests_from_agent flag to R2EGymTaskSet to choose between the existing agent-safe test staging flow (tar /r2e_tests to host, delete in-sandbox, restore on scoring) and a new fast-path for non-agent runs that keeps tests in-sandbox via mv /r2e_tests /testbed/r2e_tests.

Also updates _run_tests() to only restore tests when a cached archive exists, and extends _process_example() to include info["instance_id"] and info["repo"] aliases so TaskSet.validate() outputs have stable identifiers.

^{Reviewed by Cursor Bugbot for commit ab65f32. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ases Two small R2E-Gym quality-of-life improvements surfaced while running the full 4578-row gold-patch validation: * `hide_tests_from_agent: bool = True` — controls whether `setup()` stashes `/r2e_tests` on the host (current behavior, agent-safe) or leaves it in-sandbox via an in-place `mv /r2e_tests /testbed/r2e_tests` (new fast-path). The roundtrip was a ~2-3 min per-row cost dominated by LFS upload/download pinning the sandbox client's thread pool for the entire transfer. For validate/no-agent flows, the fast-path cuts per-row wall time roughly 4x (300s → 70s median for cached images) and unblocks the thread-pool bottleneck — required for fair agent rollouts (keep default True), strictly optional otherwise. * `_process_example` now sets `info["instance_id"] = commit_hash` and `info["repo"] = repo_name` as aliases, so `TaskSet.validate()`'s JSONL output surfaces meaningful identifiers for R2E-Gym rows (previously both fields were None because validate() looks for generic names). No change to the R2E-Gym dataset schema or any consumer that reads the original field names. Validated by a full 4578-row gold-patch run with `hide_tests_from_agent=False`: 4500/4578 pass (98.30%), no setup regressions, throughput ~40 rows/min (vs ~2.4 rows/min baseline).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r2e_gym: add hide_tests_from_agent flag + surface instance_id/repo in validate() output#1208

r2e_gym: add hide_tests_from_agent flag + surface instance_id/repo in validate() output#1208
hallerite wants to merge 1 commit intomainfrom
r2e-gym-fast-validate

hallerite commented Apr 20, 2026 •

edited by cursor bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Apr 20, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. hide_tests_from_agent: bool = True

2. Expose instance_id / repo aliases in info

Changes

Validation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Apr 20, 2026 •

edited by cursor bot

Loading

1. `hide_tests_from_agent: bool = True`

2. Expose `instance_id` / `repo` aliases in `info`