Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions assets/lab/environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,10 @@ The builder pattern is useful when:
- Multiple environment replicas don't all need to own the dataset
- You want to parameterize dataset creation without loading it immediately

Dataset builders are guarded by a timeout so hosted evaluations fail clearly instead of hanging forever before the first rollout. By default, Verifiers allows a builder up to 5 minutes to return. You can override this per environment with `dataset_build_timeout_seconds=...` on `vf.Environment` (or any subclass), or globally with the `VF_DATASET_BUILD_TIMEOUT` environment variable. Set either value to `0` or a negative number to disable the timeout.

If a builder raises an exception or exceeds the timeout, Verifiers raises `vf.DatasetBuildError` with environment context so dataset-access failures are surfaced directly in eval logs.

When a raw `Dataset` is passed directly (the default pattern), it is loaded eagerly during environment initialization for backwards compatibility.

## Rubrics
Expand Down
4 changes: 4 additions & 0 deletions docs/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,10 @@ The builder pattern is useful when:
- Multiple environment replicas don't all need to own the dataset
- You want to parameterize dataset creation without loading it immediately

Dataset builders are guarded by a timeout so hosted evaluations fail clearly instead of hanging forever before the first rollout. By default, Verifiers allows a builder up to 5 minutes to return. You can override this per environment with `dataset_build_timeout_seconds=...` on `vf.Environment` (or any subclass), or globally with the `VF_DATASET_BUILD_TIMEOUT` environment variable. Set either value to `0` or a negative number to disable the timeout.

If a builder raises an exception or exceeds the timeout, Verifiers raises `vf.DatasetBuildError` with environment context so dataset-access failures are surfaced directly in eval logs.
Comment thread
cursor[bot] marked this conversation as resolved.
Outdated

When a raw `Dataset` is passed directly (the default pattern), it is loaded eagerly during environment initialization for backwards compatibility.

## Rubrics
Expand Down
10 changes: 10 additions & 0 deletions docs/faqs.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,16 @@ Set the `VF_LOG_LEVEL` environment variable:
VF_LOG_LEVEL=DEBUG prime eval run my-environment -m gpt-4.1-mini -n 5
```

### What if my evaluation dataset hangs while loading?

If your environment uses a lazy `DatasetBuilder`, Verifiers applies a 5-minute timeout by default so evals fail clearly instead of stalling forever before the first rollout. You can tune that timeout globally with `VF_DATASET_BUILD_TIMEOUT`:

```bash
VF_DATASET_BUILD_TIMEOUT=120 prime eval run my-environment -m gpt-4.1-mini -n 5
```

Set the value to `0` or a negative number to disable the guard entirely. Dataset builder failures and timeouts are raised as `vf.DatasetBuildError` with the environment name in the message.

## Environments

### Which environment class should I use?
Expand Down
18 changes: 15 additions & 3 deletions docs/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,26 +261,31 @@ class RolloutScores(TypedDict):
class Environment(ABC):
def __init__(
self,
dataset: Dataset | None = None,
eval_dataset: Dataset | None = None,
dataset: Dataset | DatasetBuilder | None = None,
eval_dataset: Dataset | DatasetBuilder | None = None,
system_prompt: str | None = None,
few_shot: list[ChatMessage] | None = None,
few_shot: Messages | None = None,
parser: Parser | None = None,
rubric: Rubric | None = None,
sampling_args: SamplingArgs | None = None,
message_type: MessageType = "chat",
tool_defs: list[Tool] | None = None,
max_workers: int = 512,
env_id: str | None = None,
env_args: dict | None = None,
map_kwargs: dict = {},
max_seq_len: int | None = None,
score_rollouts: bool = True,
pass_threshold: float = 0.5,
dataset_build_timeout_seconds: float | None = None,
**kwargs,
): ...
```

Abstract base class for all environments.

`dataset` and `eval_dataset` may be either eager Hugging Face `Dataset` objects or lazy `DatasetBuilder` callables. When using builders, Verifiers raises `DatasetBuildError` if the builder fails, and applies a 5-minute timeout by default to prevent pre-rollout hangs. Override the timeout per environment with `dataset_build_timeout_seconds` or globally with the `VF_DATASET_BUILD_TIMEOUT` environment variable; set either to `0` or a negative value to disable the guard.

**Generation methods:**

| Method | Returns | Description |
Expand Down Expand Up @@ -876,6 +881,13 @@ vf.load_environment(env_id: str, **kwargs) -> Environment

Load an environment by ID (e.g., `"primeintellect/gsm8k"`).

```python
class DatasetBuildError(RuntimeError):
...
```

Raised when a lazy dataset builder fails or exceeds the configured dataset build timeout.

### Configuration Utilities

```python
Expand Down
4 changes: 4 additions & 0 deletions environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,10 @@ The builder pattern is useful when:
- Multiple environment replicas don't all need to own the dataset
- You want to parameterize dataset creation without loading it immediately

Dataset builders are guarded by a timeout so hosted evaluations fail clearly instead of hanging forever before the first rollout. By default, Verifiers allows a builder up to 5 minutes to return. You can override this per environment with `dataset_build_timeout_seconds=...` on `vf.Environment` (or any subclass), or globally with the `VF_DATASET_BUILD_TIMEOUT` environment variable. Set either value to `0` or a negative number to disable the timeout.

If a builder raises an exception or exceeds the timeout, Verifiers raises `vf.DatasetBuildError` with environment context so dataset-access failures are surfaced directly in eval logs.

When a raw `Dataset` is passed directly (the default pattern), it is loaded eagerly during environment initialization for backwards compatibility.

## Rubrics
Expand Down
41 changes: 41 additions & 0 deletions tests/test_environment.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Tests for the base Environment class."""

import time
from unittest.mock import AsyncMock, Mock, patch

import pytest
Expand Down Expand Up @@ -185,6 +186,46 @@ def test_get_dataset(self, sample_dataset):
subset = env.get_dataset(n=1)
assert len(subset) == 1

def test_get_eval_dataset_wraps_builder_errors(self):
"""Test eval dataset builder errors include environment context."""

def failing_builder():
raise RuntimeError("Dataset 'cais/hle' is gated")

env = SimpleEnvironment(
eval_dataset=failing_builder,
env_id="hle",
parser=Parser(),
rubric=Rubric(),
)

with pytest.raises(
RuntimeError,
match="Failed to build evaluation dataset for environment 'hle': Dataset 'cais/hle' is gated",
):
env.get_eval_dataset()

def test_get_eval_dataset_timeout_raises_clear_error(self):
"""Test slow eval dataset builders fail with a timeout instead of hanging forever."""

def slow_builder():
time.sleep(0.2)
return Dataset.from_dict({"question": ["q"], "answer": ["a"]})

env = SimpleEnvironment(
eval_dataset=slow_builder,
env_id="hle",
parser=Parser(),
rubric=Rubric(),
dataset_build_timeout_seconds=0.01,
)

with pytest.raises(
RuntimeError,
match="Building evaluation dataset for environment 'hle' timed out after 10ms",
):
env.get_eval_dataset()

@pytest.mark.asyncio
async def test_get_model_response_chat(self, mock_client, make_input):
"""Test get_model_response with chat format."""
Expand Down
43 changes: 43 additions & 0 deletions tests/test_run_evaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from unittest.mock import AsyncMock, patch

import pytest

from verifiers.types import ClientConfig, EvalConfig
from verifiers.utils.eval_utils import run_evaluation


@pytest.mark.asyncio
async def test_run_evaluation_builds_dataset_before_starting_env_server():
order: list[tuple[str, int]] = []

class FakeEnv:
def set_kwargs(self, **kwargs):
return None

def get_eval_dataset(self, n: int = -1, seed=None):
order.append(("get_eval_dataset", n))
raise RuntimeError("dataset unavailable")

start_server = AsyncMock()
stop_server = AsyncMock()

fake_env = FakeEnv()
config = EvalConfig(
env_id="hle",
env_args={},
env_dir_path="./environments",
model="openai/gpt-4.1-mini",
client_config=ClientConfig(),
sampling_args={},
num_examples=10,
rollouts_per_example=3,
max_concurrent=1,
)

with patch("verifiers.utils.eval_utils.vf.load_environment", return_value=fake_env):
with pytest.raises(RuntimeError, match="dataset unavailable"):
await run_evaluation(config)

assert order == [("get_eval_dataset", 1)]
fake_env.start_server.assert_not_awaited()
fake_env.stop_server.assert_not_awaited()
4 changes: 3 additions & 1 deletion verifiers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@
"teardown",
"ensure_keys",
"MissingKeyError",
"DatasetBuildError",
"get_model",
"get_model_and_tokenizer",
"RLConfig",
Expand All @@ -106,6 +107,7 @@
"verifiers.clients.openai_completions_client:OpenAICompletionsClient"
),
"Environment": "verifiers.envs.environment:Environment",
"DatasetBuildError": "verifiers.envs.environment:DatasetBuildError",
"MultiTurnEnv": "verifiers.envs.multiturn_env:MultiTurnEnv",
"SingleTurnEnv": "verifiers.envs.singleturn_env:SingleTurnEnv",
"StatefulToolEnv": "verifiers.envs.stateful_tool_env:StatefulToolEnv",
Expand Down Expand Up @@ -174,7 +176,7 @@ def __getattr__(name: str):
)
from .clients.openai_completions_client import OpenAICompletionsClient # noqa: F401
from .envs.env_group import EnvGroup # noqa: F401
from .envs.environment import Environment # noqa: F401
from .envs.environment import DatasetBuildError, Environment # noqa: F401
from .envs.experimental.cli_agent_env import CliAgentEnv # noqa: F401
from .envs.experimental.gym_env import GymEnv # noqa: F401
from .envs.experimental.harbor_env import HarborEnv # noqa: F401
Expand Down
94 changes: 92 additions & 2 deletions verifiers/envs/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,10 @@
import json
import logging
import multiprocessing as mp
import os
import queue
import signal
import threading
import time
import uuid
import warnings
Expand Down Expand Up @@ -71,6 +74,7 @@
with_sem,
)
from verifiers.utils.error_utils import ErrorChain
from verifiers.utils.logging_utils import print_time
from verifiers.utils.message_utils import normalize_messages
from verifiers.utils.save_utils import (
GenerateOutputsBuilder,
Expand All @@ -87,6 +91,12 @@
from verifiers.workers.client.env_client import EnvClient

_MESSAGE_TYPE_UNSET = object()
_DATASET_BUILD_TIMEOUT_ENV_VAR = "VF_DATASET_BUILD_TIMEOUT"
_DEFAULT_DATASET_BUILD_TIMEOUT_SECONDS = 300.0


class DatasetBuildError(RuntimeError):
"""Raised when building an environment dataset fails or times out."""
Comment thread
cursor[bot] marked this conversation as resolved.
Outdated


class Environment(ABC):
Expand All @@ -112,6 +122,7 @@ def __init__(
max_seq_len: int | None = None,
score_rollouts: bool = True,
pass_threshold: float = 0.5,
dataset_build_timeout_seconds: float | None = None,
**kwargs,
):
if message_type is _MESSAGE_TYPE_UNSET:
Expand Down Expand Up @@ -148,6 +159,9 @@ def __init__(

self.set_score_rollouts(score_rollouts)
self.pass_threshold = pass_threshold
self.dataset_build_timeout_seconds = self._resolve_dataset_build_timeout(
dataset_build_timeout_seconds
)

self.env_client: EnvClient | None = None
self.env_server_process: BaseProcess | None = None
Expand Down Expand Up @@ -393,13 +407,86 @@ def _format_dataset_source(self, dataset: Dataset) -> Dataset:
map_kwargs=self.map_kwargs,
)

def _resolve_dataset_build_timeout(
self, dataset_build_timeout_seconds: float | None
) -> float | None:
if dataset_build_timeout_seconds is not None:
return (
None
if dataset_build_timeout_seconds <= 0
else dataset_build_timeout_seconds
)

raw_timeout = os.getenv(_DATASET_BUILD_TIMEOUT_ENV_VAR)
if raw_timeout is not None:
try:
parsed_timeout = float(raw_timeout)
except ValueError:
self.logger.warning(
"Invalid %s=%r; using default %.0fs",
_DATASET_BUILD_TIMEOUT_ENV_VAR,
raw_timeout,
_DEFAULT_DATASET_BUILD_TIMEOUT_SECONDS,
)
else:
return None if parsed_timeout <= 0 else parsed_timeout

return _DEFAULT_DATASET_BUILD_TIMEOUT_SECONDS

def _build_dataset_from_source(
self,
source: DatasetBuilder,
*,
source_name: str,
) -> Dataset:
timeout_seconds = self.dataset_build_timeout_seconds
if timeout_seconds is None:
return source()

result_queue: queue.SimpleQueue[Dataset | BaseException] = queue.SimpleQueue()

def build_dataset() -> None:
try:
result_queue.put(source())
except BaseException as exc: # pragma: no cover - exercised via caller
result_queue.put(exc)

builder_thread = threading.Thread(
target=build_dataset,
name=f"dataset-builder-{source_name.replace(' ', '-')}",
daemon=True,
)
builder_thread.start()
builder_thread.join(timeout_seconds)

if builder_thread.is_alive():
raise DatasetBuildError(
f"Building {source_name} for environment '{self.env_id or self.__class__.__name__}' "
f"timed out after {print_time(timeout_seconds)}. "
f"Check dataset access and network reachability, or increase "
f"{_DATASET_BUILD_TIMEOUT_ENV_VAR}."
)

result = result_queue.get()
if isinstance(result, BaseException):
if isinstance(result, DatasetBuildError):
raise result
raise DatasetBuildError(
f"Failed to build {source_name} for environment '{self.env_id or self.__class__.__name__}': {result}"
) from result

return result

def build_dataset(self) -> Dataset | None:
"""Build and cache the training dataset from source if needed."""
if self.dataset is not None:
return self.dataset
if self.dataset_source is None:
return None
built = self.dataset_source()
built = self._build_dataset_from_source(
self.dataset_source,
source_name="training dataset",
)
self.dataset = self._format_dataset_source(built)
return self.dataset

Expand All @@ -409,7 +496,10 @@ def build_eval_dataset(self) -> Dataset | None:
return self.eval_dataset
if self.eval_dataset_source is None:
return None
built = self.eval_dataset_source()
built = self._build_dataset_from_source(
self.eval_dataset_source,
source_name="evaluation dataset",
)
self.eval_dataset = self._format_dataset_source(built)
return self.eval_dataset

Expand Down
4 changes: 4 additions & 0 deletions verifiers/utils/eval_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -734,6 +734,10 @@ async def run_evaluation(

results_path = config.resume_path or get_eval_results_path(config)

logger.info(f"Preparing evaluation dataset for {config.env_id}")
vf_env.get_eval_dataset(n=1)
logger.info(f"Evaluation dataset ready for {config.env_id}")

try:
if not config.disable_env_server:
extra_env_kwargs = dict(config.extra_env_kwargs)
Expand Down
Loading