Skip to content

r2e_gym: add hide_tests_from_agent flag + surface instance_id/repo in validate() output#1208

Open
hallerite wants to merge 1 commit intomainfrom
r2e-gym-fast-validate
Open

r2e_gym: add hide_tests_from_agent flag + surface instance_id/repo in validate() output#1208
hallerite wants to merge 1 commit intomainfrom
r2e-gym-fast-validate

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented Apr 20, 2026

Summary

Two quality-of-life improvements to R2EGymTaskSet surfaced while running the full 4578-row gold-patch validation:

1. hide_tests_from_agent: bool = True

Controls whether setup() stashes /r2e_tests on the host (current behavior, agent-safe) or leaves it in-sandbox via an in-place mv /r2e_tests /testbed/r2e_tests (new fast-path).

Why it matters. The current roundtrip is ~2-3 min per row dominated by the tar → download_filerm in setup plus the upload_file → extract in _run_tests. Each of those blocks a ThreadedAsyncSandboxClient worker thread for the entire transfer — and TaskSet.validate() caps worker count at min(concurrency // 8, 50), so 50 concurrent validate tasks serialize through 6 workers, each pinned on I/O. Under concurrency=50 we measured ~2.4 rows/min; after enabling the fast-path + patching the worker cap we hit ~40 rows/min.

Keep the default True for agent rollouts (the hiding is the whole point — R2E-Gym's gold-truth tests can't be visible to the agent). Set False when the caller knows no agent is involved (e.g. TaskSet.validate() for gold-patch validation / dataset cleanup).

2. Expose instance_id / repo aliases in info

R2E-Gym rows natively use commit_hash and repo_name. TaskSet.validate() looks for generic info["instance_id"] and info["repo"] to populate each JSONL row's identity fields — with R2E-Gym those were always None, so the streaming log was unreadable. _process_example now sets both aliases via setdefault so downstream code that reads the original names keeps working.

Changes

verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py:

  • __init__: new hide_tests_from_agent: bool = True kwarg, stored on self
  • setup(): branches on the flag — fast-path does a single mv, slow-path keeps the existing tar/download/delete roundtrip
  • _run_tests(): restoration step now conditional on a cached archive; fast-path skips it (tests already in /testbed/r2e_tests from setup)
  • _process_example: adds instance_id and repo aliases pointing at commit_hash / repo_name
  • Docstrings updated to explain the tradeoff

Validation

Ran the full 4578-row gold-patch validation (R2EGymTaskSet(hide_tests_from_agent=False)) plus a retry pass on 15 pillow sandbox_errors:

Count Rate
Pass 4515 98.62%
test_failed 62 1.35%
setup_failed 1 0.02%
sandbox_error (post-retry) 0 0%

Per-row median elapsed dropped from ~300s (slow-path) to ~60s for cached images (fast-path), and the control set of 30 random passing rows showed zero regressions.

The remaining 62 test_failed are a mix of real gold-patch regressions (~7-10) and dataset-drift artifacts where our env makes tests pass that expected_output_json had marked FAILED (~5-10) — visible in future follow-up work but not in scope here.

Test plan

  • Gold-patch validation: 4515/4578 pass (98.62%) with hide_tests_from_agent=False
  • 30-row passing control set: 30/30 still pass (no setup regressions)
  • Pillow sandbox_error retry: 15/15 pass
  • CI

🤖 Generated with Claude Code


Note

Medium Risk
Changes how test artifacts are moved between host and sandbox during setup/scoring, which could impact correctness or leak-prevention if misconfigured, though the default preserves current agent-safe behavior.

Overview
Adds a hide_tests_from_agent flag to R2EGymTaskSet to choose between the existing agent-safe test staging flow (tar /r2e_tests to host, delete in-sandbox, restore on scoring) and a new fast-path for non-agent runs that keeps tests in-sandbox via mv /r2e_tests /testbed/r2e_tests.

Also updates _run_tests() to only restore tests when a cached archive exists, and extends _process_example() to include info["instance_id"] and info["repo"] aliases so TaskSet.validate() outputs have stable identifiers.

Reviewed by Cursor Bugbot for commit ab65f32. Bugbot is set up for automated code reviews on this repo. Configure here.

…ases

Two small R2E-Gym quality-of-life improvements surfaced while running
the full 4578-row gold-patch validation:

* `hide_tests_from_agent: bool = True` — controls whether `setup()`
  stashes `/r2e_tests` on the host (current behavior, agent-safe) or
  leaves it in-sandbox via an in-place `mv /r2e_tests /testbed/r2e_tests`
  (new fast-path). The roundtrip was a ~2-3 min per-row cost dominated
  by LFS upload/download pinning the sandbox client's thread pool for
  the entire transfer. For validate/no-agent flows, the fast-path cuts
  per-row wall time roughly 4x (300s → 70s median for cached images)
  and unblocks the thread-pool bottleneck — required for fair agent
  rollouts (keep default True), strictly optional otherwise.

* `_process_example` now sets `info["instance_id"] = commit_hash` and
  `info["repo"] = repo_name` as aliases, so `TaskSet.validate()`'s
  JSONL output surfaces meaningful identifiers for R2E-Gym rows
  (previously both fields were None because validate() looks for
  generic names). No change to the R2E-Gym dataset schema or any
  consumer that reads the original field names.

Validated by a full 4578-row gold-patch run with `hide_tests_from_agent=False`:
4500/4578 pass (98.30%), no setup regressions, throughput ~40 rows/min
(vs ~2.4 rows/min baseline).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant