Add wall-clock timeout support to MultiTurnEnv by xeophon · Pull Request #1166 · PrimeIntellect-ai/verifiers

xeophon · 2026-04-17T11:12:17Z

Summary

add a nullable timeout_seconds kwarg to MultiTurnEnv and stop timed out rollouts with timeout_reached
keep timeout configuration on the existing kwargs path (extra_env_kwargs) instead of adding a first-party eval flag
add focused tests and docs updates, plus a tiny CliAgentEnv typing fix needed for the repo's ty pre-push check

Testing

uv run pytest tests/test_multiturn_env.py tests/test_eval_cli.py
uv run pre-commit run --all-files
ad hoc smoke test: ToolEnv with a 10s sleeping tool and timeout_seconds=5 stopped in ~5s with stop_condition=timeout_reached

Note

Medium Risk
Adds cancellation-based wall-clock timeouts to the core multi-turn rollout loop and timing measurements, which can affect rollout termination, cleanup, and state flags across all derived environments.

Overview
Adds a nullable timeout_seconds to MultiTurnEnv and a built-in timeout_reached stop condition that marks rollouts as timed out/truncated and can hard-stop the rollout loop via task cancellation when the wall-clock limit is exceeded (including during setup_state).

Updates rollout timing to use time.perf_counter() (tracked via _start_perf_counter) and propagates the new timeout behavior through CliAgentEnv (inherits the stop condition, allows disabling sandbox timeouts, and treats timeouts as non-completions) plus makes sandbox creation resilient to cancellation by scheduling cleanup of a sandbox created after the caller is cancelled.

Extends CLI/TOML config parsing tests to ensure extra_env_kwargs.timeout_seconds is preserved, adds focused timeout-related unit tests, and updates docs/examples to mention timeout_seconds as a default stop/rollout limit.

^{Reviewed by Cursor Bugbot for commit f4ea7c9. Bugbot is set up for automated code reviews on this repo. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 777360656e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a044a4bab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a9340fdca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a46bbbcfd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T16:54:24Z

+            state["_setup_task"] = setup_task
            try:
-                state = await self.setup_state(state)
+                state = await asyncio.shield(setup_task)


Prevent detached setup task on rollout cancellation

setup_state is now wrapped in asyncio.shield, so if rollout() is cancelled by the caller (for example when generate() cancels pending rollout tasks during shutdown), the setup coroutine keeps running after cancellation while this method immediately proceeds to cleanup. In environments that allocate external resources in setup_state (e.g., sandbox creation), those side effects can complete after cleanup/teardown has already run, leaving resources untracked and potentially leaked.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 01ed18066b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T17:19:12Z

                if await self.check_agent_completed(state):
                    state["agent_completed"] = True
                    return None


Enforce rollout timeout while polling for requests

This loop now exits only when check_agent_completed(state) is true, so wall-clock timeout is effectively tied to wait_for_completion, whose asyncio.wait_for(..., timeout=self.timeout_seconds) starts only after start_agent runs in setup_state. If sandbox/tunnel setup is slow and the agent never sends requests, rollouts can run for setup_time + timeout_seconds instead of the configured timeout. This regresses timeout enforcement for CliAgentEnv under slow setup or startup hangs; add a direct elapsed/deadline check in this polling loop using the rollout start timestamp.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5eecf37e82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T17:27:48Z

+    async def timeout_reached(self, state: State) -> bool:
+        if self.timeout_seconds is None:
+            return False
+        if time.perf_counter() - state["timing"]["start_time"] <= self.timeout_seconds:
+            return False


Enforce timeout during in-flight rollout operations

This timeout check only runs when stop conditions are polled, but rollout() can spend unbounded time awaiting setup_state, get_prompt_messages, or get_model_response between those polls. As a result, a hung/slow setup or model call can exceed timeout_seconds by a large margin (or until an external timeout fires), so the new wall-clock timeout does not reliably cap rollout runtime.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-18T17:27:48Z

+        return state.get("agent_completed", False) and not state.get(
+            "agent_timed_out", False


Exclude timed-out states from agent_completed stop

This condition ignores state["timed_out"], but _poll_next_request now sets timed_out directly when the rollout deadline is hit. If agent_completed flips before the next stop-condition pass, agent_completed (alphabetically earlier) can terminate the rollout first, yielding stop_condition="agent_completed" for an already timed-out rollout and misclassifying timeout outcomes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c590844138

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T17:34:46Z

+            with suppress(Exception):
+                sandbox = await create_task


Stop awaiting sandbox creation after cancellation

In the CliAgentEnv setup path, cancellation is what enforces timeout_seconds, but this CancelledError handler waits for create_task to finish before re-raising. If sandbox creation is slow (API latency/retries), a timed-out rollout will block here until creation completes, so wall-clock timeout and shutdown cancellation can overshoot by minutes instead of returning promptly. This should not synchronously await an unbounded create call on the cancellation path.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f4ea7c9. Configure here.}

cursor · 2026-04-18T17:49:26Z

-        start_time = state["timing"]["start_time"]
-        end_time = time.time()
+        start_time = state.get("_start_perf_counter", state["timing"]["start_time"])
+        end_time = time.perf_counter()


Fallback mixes incompatible clock sources, silently breaking timeouts

Medium Severity

The fallback in state.get("_start_perf_counter", state["timing"]["start_time"]) mixes two incompatible clock sources. state["timing"]["start_time"] is set via time.time() (~1.7 billion epoch seconds), but it's subtracted from time.perf_counter() (a much smaller monotonic value). If the fallback ever triggers, the result is a large negative number: timeout_reached silently never fires, and _render_timing records wildly negative millisecond values. The same broken pattern appears in all three files.

Additional Locations (2)

verifiers/envs/multiturn_env.py#L81-L83

verifiers/envs/experimental/cli_agent_env.py#L482-L488

^{Reviewed by Cursor Bugbot for commit f4ea7c9. Configure here.}

Add MultiTurnEnv timeout kwarg

12d9b08

xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 7773606 to 12d9b08 Compare April 17, 2026 11:14

chatgpt-codex-connector bot reviewed Apr 17, 2026

View reviewed changes

Comment thread verifiers/envs/multiturn_env.py Outdated

Comment thread verifiers/envs/multiturn_env.py Outdated

cursor bot reviewed Apr 17, 2026

View reviewed changes

Comment thread verifiers/envs/multiturn_env.py

Comment thread docs/evaluation.md

snimu reviewed Apr 17, 2026

View reviewed changes

Comment thread verifiers/envs/multiturn_env.py Outdated

xeophon force-pushed the codex/multiturn-timeout-kwargs branch 3 times, most recently from 86cb453 to 0a044a4 Compare April 18, 2026 16:24

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread verifiers/envs/multiturn_env.py Outdated

xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 0a044a4 to 9a9340f Compare April 18, 2026 16:34

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread verifiers/envs/multiturn_env.py Outdated

cursor bot reviewed Apr 18, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/cli_agent_env.py Outdated

xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 9a9340f to a46bbbc Compare April 18, 2026 16:48

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

cursor bot reviewed Apr 18, 2026

View reviewed changes

Comment thread verifiers/envs/multiturn_env.py Outdated

Comment thread verifiers/envs/experimental/cli_agent_env.py Outdated

xeophon force-pushed the codex/multiturn-timeout-kwargs branch from a46bbbc to 01ed180 Compare April 18, 2026 17:14

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 01ed180 to 5eecf37 Compare April 18, 2026 17:20

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

xeophon force-pushed the codex/multiturn-timeout-kwargs branch from 5eecf37 to c590844 Compare April 18, 2026 17:30

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

cursor bot reviewed Apr 18, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

Clean up multiturn timeout handling

f4ea7c9

xeophon force-pushed the codex/multiturn-timeout-kwargs branch from c590844 to f4ea7c9 Compare April 18, 2026 17:40

cursor bot reviewed Apr 18, 2026

View reviewed changes

		return state.get("agent_completed", False) and not state.get(
		"agent_timed_out", False

Conversation

xeophon commented Apr 17, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 18, 2026

Choose a reason for hiding this comment

Fallback mixes incompatible clock sources, silently breaking timeouts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xeophon commented Apr 17, 2026 •

edited by cursor bot

Loading