Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,60 @@ prime eval run primeintellect/math-python
**[FAQs](docs/faqs.md)** - Other frequently asked questions.


## Supported Patterns

Verifiers supports a wide range of RL framework design patterns. Below is an overview of what's supported out of the box:

### Context Management
- **Context compaction** — Automatic message history management via `MultiTurnEnv` turn limits
- **Token-aware truncation** — Configurable max tokens per rollout
- **System prompt handling** — Persistent system prompts across turns

### User Simulations
- **Multi-turn agents** — `MultiTurnEnv` for interactive agent tasks
- **Tool-augmented interactions** — `ToolEnv` and `StatefulToolEnv` for tool-using agents
- **Browser automation** — `BrowserEnv` for web-based agent tasks

### Native Tool Parsing
- **XML-based parsing** — `XMLParser` for structured output extraction
- **Tool call handling** — Native support for OpenAI-style tool calls
- **Custom parsers** — Extensible parser system for any output format

### Sandboxing
- **Harness-in-sandbox** — `SandboxEnv` for isolated execution environments
- **Harness-outside-of-sandbox** — Standard environments run locally
- **No sandbox** — Lightweight mode for simple tasks
- **Container management** — Automatic sandbox provisioning and cleanup

### Reward Systems
- **Groupwise rewards** — Batch-based reward computation for GRPO training
- **Intermediate rewards** — Per-turn reward signals in multi-turn tasks
- **Rubric composition** — Combine multiple reward functions with weighted scoring
- **Monitor rubrics** — Automatic metric collection during rollouts

### Multi-Environment Support
- **Environment groups** — `EnvGroup` for running multiple environments in parallel
- **Environment mixing** — Composite datasets from multiple sources
- **A/B evaluation** — Compare models across different environments

### Resource Management
- **Async execution** — Non-blocking I/O for API calls and tool execution
- **Parallel rollouts** — Configurable concurrency for batch evaluation
- **Memory sharing** — Efficient memo-based object sharing across rollouts

### Custom Metrics & Error Handling
- **Custom reward functions** — Python callables for any scoring logic
- **Error tracking** — Structured error reporting in rollout data
- **Debug logging** — Detailed logging for development and troubleshooting

### Offline Evals
- **Local evaluation** — `prime eval run` for testing without training
- **Evaluation TUI** — Terminal UI for browsing eval results
- **Pass@k metrics** — Support for pass@k and ablation sweeps
- **Result persistence** — Automatic saving of evaluation results

For detailed documentation on each pattern, see the [Documentation](#documentation) section above.

## Citation

Originally created by Will Brown ([@willccbb](https://github.com/willccbb)).
Expand Down
47 changes: 47 additions & 0 deletions environments/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,53 @@ This folder contains installable example environments that showcase common usage
- **Install an environment from this GitHub repo**: `prime env install math-python --from-repo`
- **Evaluate**: `prime eval run math-python` (defaults to openai/gpt-4.1-mini, small sample)


## Installation Methods

Environments can be installed in two ways:

### 1. From Prime Intellect Hub (Recommended)

```bash
# Install from the Hub (requires prime CLI)
prime env install primeintellect/math-python

# Or install from local repo
prime env install math-python --from-repo
```

This is the primary method and works for **all 23 environments** in this repository. The Hub provides versioning, dependency resolution, and integration with `prime eval run`.

### 2. From Pip Index (Limited)

```bash
# Some environments are available on the pip index
pip install prime-env-math-python # Example format
```

**Note:** Only a subset of environments are published to the pip index (`hub.primeintellect.ai/ob1/simple/`). For the complete list of available environments, use the Hub directly.

### Which Method Should I Use?

| Use Case | Method |
|----------|--------|
| **Local development & evaluation** | `prime env install` from Hub |
| **CI/CD pipelines** | `prime env install` in workflow |
| **Dependency in pyproject.toml** | Use Hub with `prime env install` |
| **Standalone pip install** | Check pip index availability |

### For Framework Integration

When integrating environments into training frameworks (like `prime-rl`), use the Hub method:

```bash
# In your project setup
prime env install primeintellect/math-python
prime eval run math-python -m openai/gpt-4.1-mini
```

This ensures you get the latest version with proper dependency resolution.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hub vs pip guidance missing from docs/faqs.md

Low Severity

This PR addresses user confusion from issue #1100 about pip index availability, which is notable FAQ-worthy information. The Hub vs pip installation distinction is only documented in environments/README.md but not in docs/faqs.md, where users commonly look for such clarifications. The project rule states that notable information available for reference that doesn't neatly map to a specific documentation section belongs in docs/faqs.md.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit fec3ead. Configure here.


## Common usage patterns and examples

### SingleTurnEnv (prompt → single response)
Expand Down
5 changes: 5 additions & 0 deletions verifiers/envs/integrations/textarena_env.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,11 @@ def ta_to_hf(self) -> tuple[Dataset, Dataset | None]:
eval_dataset_rows = []
_, user_prompt = self.ta_env.get_observation()
words = self.ta_env.word_list
# Handle both list and dict word_list formats
# Dict format (e.g. TwentyQuestions-v0): {category: [words]}
# List format (e.g. Wordle-v0): [words]
if isinstance(words, dict):
words = [word for category_words in words.values() for word in category_words]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict word_list fix enables silent correctness bug for non-Wordle games

Medium Severity

The new dict word_list handling explicitly targets TwentyQuestions-v0, allowing ta_to_hf to generate a dataset for it. However, setup_state still hardcodes game_state["secret_word"], which is Wordle-specific. For TwentyQuestions-v0, this sets an unrelated key instead of the game's actual target, so the dataset answer won't match the game's hidden word. Before this change, dict-based games crashed immediately in ta_to_hf; now they silently produce mismatched answers, causing incorrect reward computation.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fec3ead. Configure here.

# set seed
random.seed(self.seed)
for i in range(self.num_train_examples + self.num_eval_examples):
Expand Down