Skip to content

Add wall-clock timeout support to MultiTurnEnv#1166

Open
xeophon wants to merge 2 commits intomainfrom
codex/multiturn-timeout-kwargs
Open

Add wall-clock timeout support to MultiTurnEnv#1166
xeophon wants to merge 2 commits intomainfrom
codex/multiturn-timeout-kwargs

Conversation

@xeophon
Copy link
Copy Markdown
Member

@xeophon xeophon commented Apr 17, 2026

Summary

  • add a nullable timeout_seconds kwarg to MultiTurnEnv and stop timed out rollouts with timeout_reached
  • keep timeout configuration on the existing kwargs path (extra_env_kwargs) instead of adding a first-party eval flag
  • add focused tests and docs updates, plus a tiny CliAgentEnv typing fix needed for the repo's ty pre-push check

Testing

  • uv run pytest tests/test_multiturn_env.py tests/test_eval_cli.py
  • uv run pre-commit run --all-files
  • ad hoc smoke test: ToolEnv with a 10s sleeping tool and timeout_seconds=5 stopped in ~5s with stop_condition=timeout_reached

Note

Medium Risk
Adds cancellation-based wall-clock timeouts to the core multi-turn rollout loop and timing measurements, which can affect rollout termination, cleanup, and state flags across all derived environments.

Overview
Adds a nullable timeout_seconds to MultiTurnEnv and a built-in timeout_reached stop condition that marks rollouts as timed out/truncated and can hard-stop the rollout loop via task cancellation when the wall-clock limit is exceeded (including during setup_state).

Updates rollout timing to use time.perf_counter() (tracked via _start_perf_counter) and propagates the new timeout behavior through CliAgentEnv (inherits the stop condition, allows disabling sandbox timeouts, and treats timeouts as non-completions) plus makes sandbox creation resilient to cancellation by scheduling cleanup of a sandbox created after the caller is cancelled.

Extends CLI/TOML config parsing tests to ensure extra_env_kwargs.timeout_seconds is preserved, adds focused timeout-related unit tests, and updates docs/examples to mention timeout_seconds as a default stop/rollout limit.

Reviewed by Cursor Bugbot for commit f4ea7c9. Bugbot is set up for automated code reviews on this repo. Configure here.

@xeophon xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 7773606 to 12d9b08 Compare April 17, 2026 11:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 777360656e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/multiturn_env.py Outdated
Comment thread verifiers/envs/multiturn_env.py Outdated
Comment thread verifiers/envs/multiturn_env.py
Comment thread docs/evaluation.md
Comment thread verifiers/envs/multiturn_env.py Outdated
@xeophon xeophon force-pushed the codex/multiturn-timeout-kwargs branch 3 times, most recently from 86cb453 to 0a044a4 Compare April 18, 2026 16:24
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a044a4bab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/multiturn_env.py Outdated
@xeophon xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 0a044a4 to 9a9340f Compare April 18, 2026 16:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a9340fdca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/multiturn_env.py Outdated
Comment thread verifiers/envs/experimental/cli_agent_env.py Outdated
@xeophon xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 9a9340f to a46bbbc Compare April 18, 2026 16:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a46bbbcfd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/multiturn_env.py Outdated
state["_setup_task"] = setup_task
try:
state = await self.setup_state(state)
state = await asyncio.shield(setup_task)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent detached setup task on rollout cancellation

setup_state is now wrapped in asyncio.shield, so if rollout() is cancelled by the caller (for example when generate() cancels pending rollout tasks during shutdown), the setup coroutine keeps running after cancellation while this method immediately proceeds to cleanup. In environments that allocate external resources in setup_state (e.g., sandbox creation), those side effects can complete after cleanup/teardown has already run, leaving resources untracked and potentially leaked.

Useful? React with 👍 / 👎.

Comment thread verifiers/envs/multiturn_env.py Outdated
Comment thread verifiers/envs/experimental/cli_agent_env.py Outdated
@xeophon xeophon force-pushed the codex/multiturn-timeout-kwargs branch from a46bbbc to 01ed180 Compare April 18, 2026 17:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 01ed18066b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 481 to 483
if await self.check_agent_completed(state):
state["agent_completed"] = True
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce rollout timeout while polling for requests

This loop now exits only when check_agent_completed(state) is true, so wall-clock timeout is effectively tied to wait_for_completion, whose asyncio.wait_for(..., timeout=self.timeout_seconds) starts only after start_agent runs in setup_state. If sandbox/tunnel setup is slow and the agent never sends requests, rollouts can run for setup_time + timeout_seconds instead of the configured timeout. This regresses timeout enforcement for CliAgentEnv under slow setup or startup hangs; add a direct elapsed/deadline check in this polling loop using the rollout start timestamp.

Useful? React with 👍 / 👎.

@xeophon xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 01ed180 to 5eecf37 Compare April 18, 2026 17:20
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5eecf37e82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +78 to +82
async def timeout_reached(self, state: State) -> bool:
if self.timeout_seconds is None:
return False
if time.perf_counter() - state["timing"]["start_time"] <= self.timeout_seconds:
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce timeout during in-flight rollout operations

This timeout check only runs when stop conditions are polled, but rollout() can spend unbounded time awaiting setup_state, get_prompt_messages, or get_model_response between those polls. As a result, a hung/slow setup or model call can exceed timeout_seconds by a large margin (or until an external timeout fires), so the new wall-clock timeout does not reliably cap rollout runtime.

Useful? React with 👍 / 👎.

Comment on lines +650 to +651
return state.get("agent_completed", False) and not state.get(
"agent_timed_out", False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude timed-out states from agent_completed stop

This condition ignores state["timed_out"], but _poll_next_request now sets timed_out directly when the rollout deadline is hit. If agent_completed flips before the next stop-condition pass, agent_completed (alphabetically earlier) can terminate the rollout first, yielding stop_condition="agent_completed" for an already timed-out rollout and misclassifying timeout outcomes.

Useful? React with 👍 / 👎.

@xeophon xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 5eecf37 to c590844 Compare April 18, 2026 17:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c590844138

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +179 to +180
with suppress(Exception):
sandbox = await create_task
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop awaiting sandbox creation after cancellation

In the CliAgentEnv setup path, cancellation is what enforces timeout_seconds, but this CancelledError handler waits for create_task to finish before re-raising. If sandbox creation is slow (API latency/retries), a timed-out rollout will block here until creation completes, so wall-clock timeout and shutdown cancellation can overshoot by minutes instead of returning promptly. This should not synchronously await an unbounded create call on the cancellation path.

Useful? React with 👍 / 👎.

Comment thread verifiers/envs/environment.py Outdated
@xeophon xeophon force-pushed the codex/multiturn-timeout-kwargs branch from c590844 to f4ea7c9 Compare April 18, 2026 17:40
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f4ea7c9. Configure here.

start_time = state["timing"]["start_time"]
end_time = time.time()
start_time = state.get("_start_perf_counter", state["timing"]["start_time"])
end_time = time.perf_counter()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback mixes incompatible clock sources, silently breaking timeouts

Medium Severity

The fallback in state.get("_start_perf_counter", state["timing"]["start_time"]) mixes two incompatible clock sources. state["timing"]["start_time"] is set via time.time() (~1.7 billion epoch seconds), but it's subtracted from time.perf_counter() (a much smaller monotonic value). If the fallback ever triggers, the result is a large negative number: timeout_reached silently never fires, and _render_timing records wildly negative millisecond values. The same broken pattern appears in all three files.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f4ea7c9. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants