Add wall-clock timeout support to MultiTurnEnv#1166
Conversation
7773606 to
12d9b08
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 777360656e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
86cb453 to
0a044a4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a044a4bab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
0a044a4 to
9a9340f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a9340fdca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
9a9340f to
a46bbbc
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a46bbbcfd7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| state["_setup_task"] = setup_task | ||
| try: | ||
| state = await self.setup_state(state) | ||
| state = await asyncio.shield(setup_task) |
There was a problem hiding this comment.
Prevent detached setup task on rollout cancellation
setup_state is now wrapped in asyncio.shield, so if rollout() is cancelled by the caller (for example when generate() cancels pending rollout tasks during shutdown), the setup coroutine keeps running after cancellation while this method immediately proceeds to cleanup. In environments that allocate external resources in setup_state (e.g., sandbox creation), those side effects can complete after cleanup/teardown has already run, leaving resources untracked and potentially leaked.
Useful? React with 👍 / 👎.
a46bbbc to
01ed180
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 01ed18066b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if await self.check_agent_completed(state): | ||
| state["agent_completed"] = True | ||
| return None |
There was a problem hiding this comment.
Enforce rollout timeout while polling for requests
This loop now exits only when check_agent_completed(state) is true, so wall-clock timeout is effectively tied to wait_for_completion, whose asyncio.wait_for(..., timeout=self.timeout_seconds) starts only after start_agent runs in setup_state. If sandbox/tunnel setup is slow and the agent never sends requests, rollouts can run for setup_time + timeout_seconds instead of the configured timeout. This regresses timeout enforcement for CliAgentEnv under slow setup or startup hangs; add a direct elapsed/deadline check in this polling loop using the rollout start timestamp.
Useful? React with 👍 / 👎.
01ed180 to
5eecf37
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5eecf37e82
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| async def timeout_reached(self, state: State) -> bool: | ||
| if self.timeout_seconds is None: | ||
| return False | ||
| if time.perf_counter() - state["timing"]["start_time"] <= self.timeout_seconds: | ||
| return False |
There was a problem hiding this comment.
Enforce timeout during in-flight rollout operations
This timeout check only runs when stop conditions are polled, but rollout() can spend unbounded time awaiting setup_state, get_prompt_messages, or get_model_response between those polls. As a result, a hung/slow setup or model call can exceed timeout_seconds by a large margin (or until an external timeout fires), so the new wall-clock timeout does not reliably cap rollout runtime.
Useful? React with 👍 / 👎.
| return state.get("agent_completed", False) and not state.get( | ||
| "agent_timed_out", False |
There was a problem hiding this comment.
Exclude timed-out states from agent_completed stop
This condition ignores state["timed_out"], but _poll_next_request now sets timed_out directly when the rollout deadline is hit. If agent_completed flips before the next stop-condition pass, agent_completed (alphabetically earlier) can terminate the rollout first, yielding stop_condition="agent_completed" for an already timed-out rollout and misclassifying timeout outcomes.
Useful? React with 👍 / 👎.
5eecf37 to
c590844
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c590844138
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| with suppress(Exception): | ||
| sandbox = await create_task |
There was a problem hiding this comment.
Stop awaiting sandbox creation after cancellation
In the CliAgentEnv setup path, cancellation is what enforces timeout_seconds, but this CancelledError handler waits for create_task to finish before re-raising. If sandbox creation is slow (API latency/retries), a timed-out rollout will block here until creation completes, so wall-clock timeout and shutdown cancellation can overshoot by minutes instead of returning promptly. This should not synchronously await an unbounded create call on the cancellation path.
Useful? React with 👍 / 👎.
c590844 to
f4ea7c9
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f4ea7c9. Configure here.
| start_time = state["timing"]["start_time"] | ||
| end_time = time.time() | ||
| start_time = state.get("_start_perf_counter", state["timing"]["start_time"]) | ||
| end_time = time.perf_counter() |
There was a problem hiding this comment.
Fallback mixes incompatible clock sources, silently breaking timeouts
Medium Severity
The fallback in state.get("_start_perf_counter", state["timing"]["start_time"]) mixes two incompatible clock sources. state["timing"]["start_time"] is set via time.time() (~1.7 billion epoch seconds), but it's subtracted from time.perf_counter() (a much smaller monotonic value). If the fallback ever triggers, the result is a large negative number: timeout_reached silently never fires, and _render_timing records wildly negative millisecond values. The same broken pattern appears in all three files.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit f4ea7c9. Configure here.


Summary
timeout_secondskwarg toMultiTurnEnvand stop timed out rollouts withtimeout_reachedextra_env_kwargs) instead of adding a first-party eval flagCliAgentEnvtyping fix needed for the repo'stypre-push checkTesting
uv run pytest tests/test_multiturn_env.py tests/test_eval_cli.pyuv run pre-commit run --all-filesToolEnvwith a 10s sleeping tool andtimeout_seconds=5stopped in ~5s withstop_condition=timeout_reachedNote
Medium Risk
Adds cancellation-based wall-clock timeouts to the core multi-turn rollout loop and timing measurements, which can affect rollout termination, cleanup, and state flags across all derived environments.
Overview
Adds a nullable
timeout_secondstoMultiTurnEnvand a built-intimeout_reachedstop condition that marks rollouts as timed out/truncated and can hard-stop the rollout loop via task cancellation when the wall-clock limit is exceeded (including duringsetup_state).Updates rollout timing to use
time.perf_counter()(tracked via_start_perf_counter) and propagates the new timeout behavior throughCliAgentEnv(inherits the stop condition, allows disabling sandbox timeouts, and treats timeouts as non-completions) plus makes sandbox creation resilient to cancellation by scheduling cleanup of a sandbox created after the caller is cancelled.Extends CLI/TOML config parsing tests to ensure
extra_env_kwargs.timeout_secondsis preserved, adds focused timeout-related unit tests, and updates docs/examples to mentiontimeout_secondsas a default stop/rollout limit.Reviewed by Cursor Bugbot for commit f4ea7c9. Bugbot is set up for automated code reviews on this repo. Configure here.