Skip to content

Fail fast when eval dataset builds stall#1124

Open
d42me wants to merge 4 commits intoPrimeIntellect-ai:mainfrom
d42me:fix/hle-dataset-build-timeout
Open

Fail fast when eval dataset builds stall#1124
d42me wants to merge 4 commits intoPrimeIntellect-ai:mainfrom
d42me:fix/hle-dataset-build-timeout

Conversation

@d42me
Copy link
Copy Markdown
Collaborator

@d42me d42me commented Apr 10, 2026

Summary

  • add a bounded dataset-build guard so lazy dataset builders raise instead of hanging forever
  • wrap dataset build failures with environment context for clearer hosted eval errors
  • prepare the evaluation dataset before starting the env server so pre-rollout stalls are attributed correctly

Testing

  • uv run pytest tests/test_environment.py tests/test_environment_extra.py tests/test_run_evaluation.py
  • uv run ruff check verifiers/envs/environment.py verifiers/utils/eval_utils.py tests/test_environment.py tests/test_run_evaluation.py

Note

Medium Risk
Changes the evaluation startup sequence and introduces a threaded timeout guard around dataset building, which could affect environments with unusual get_eval_dataset behavior or long first-load times.

Overview
run_evaluation() now prepares the eval dataset before starting the env server, calling get_eval_dataset(n=1) to surface dataset-access failures early.

This adds a timeout guard (default 5 minutes) controlled by VF_DATASET_BUILD_TIMEOUT; when enabled, dataset prep runs in a background thread and raises a RuntimeError on timeout or build failure with the env id included.

Adds focused async tests covering call ordering (dataset before server), error wrapping, and timeout behavior, and updates docs/FAQs/skill guidance to document the new guard and troubleshooting knob.

Reviewed by Cursor Bugbot for commit 23e6ed0. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread verifiers/envs/environment.py Outdated
Comment thread docs/environments.md Outdated
Comment thread verifiers/utils/eval_utils.py
Comment thread docs/evaluation.md
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 23e6ed0. Configure here.

except BaseException as exc:
raise RuntimeError(
f"Failed to prepare evaluation dataset for {env_id}: {exc}"
) from exc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseException catch swallows KeyboardInterrupt as RuntimeError

Medium Severity

In the non-threaded path (when timeout is disabled via VF_DATASET_BUILD_TIMEOUT=0), except BaseException catches KeyboardInterrupt and SystemExit and wraps them in RuntimeError. This converts a KeyboardInterrupt into an Exception subclass, which changes downstream handling — the except KeyboardInterrupt handler in run_evaluations_tui won't match it, and except Exception handlers that are not meant to catch interrupts will. Using except Exception here would let KeyboardInterrupt and SystemExit propagate naturally. The except BaseException in the threaded guarded_build_eval_dataset (line 93) is fine since KeyboardInterrupt is normally delivered to the main thread.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 23e6ed0. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant