feat: integrate CodeDebug (HumanEvalFix) environment and core server stabilization by RUFFY-369 · Pull Request #421 · NousResearch/atropos

RUFFY-369 · 2026-03-27T19:54:10Z

Note

Research Context: This PR provides the underlying CodeDebugEnv backend required for the agentic reasoning and vLLM stabilization benchmarks in hermes-agent (PR #3448).

PR Type

RL Environment PR - Complete Environment Snapshot & Zero-Training sections
Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

This pull request introduces the CodeDebug environment for standardized debugging benchmarks and includes critical stabilization for the core server handling layer.

New Environment: Integrated environments/community/code_debug_env/ based on HumanEvalFix.
Server Stabilization: Refactored openai_server.py to use structured logging and robust completion handling, removing legacy debug prints and emojis.
Safe Execution: Added code_executor.py for isolated, execution-based scoring of code fixes.

Related Issues

Closes (No specific issue linked, new feature/env)

Type of Change

New feature (non-breaking change which adds functionality)
Code refactor (no functional changes)
Documentation update

🔖 Environment Snapshot

Field	Your Entry
Environment Name	`CodeDebug-v0`
Short Description	Standardized debugging benchmark based on HumanEvalFix.
Category	Verifiable-Reasoning
Dataset Needed?	Yes (`bigcode/humanevalpack`)
External Deps	`datasets`, `tqdm`
Environmental Variables	N/A
Compute Footprint Estimate	<1 GB RAM, <2 sec per verification

🧪 Zero-Training Test Results

Details

W&B Link: N/A (Initial Integration)

Examples of the Environment scoring:

Good Fix (Score 1.0): Correctly identifies a syntax error or logic flaw and provides a passing implementation.
Partial Improvement (Score Variable): Provides code that passes more tests than the original buggy version but does not yet satisfy the full suite.
Failure (Score -1.0): Code fails to compile or makes the logic worse.

✅ Developer & Reviewer Checklist

Code follows project style (black, ruff pass)
I have performed a self-review of my own code
I have commented my code appropriately
My changes generate no new warnings
New and existing unit tests pass locally (test_code_debug.py)
Docstrings added for all new public classes / functions

cc @dmahan93 @teknium1

…-ASCII characters

RUFFY-369 added 7 commits March 24, 2026 13:05

feat: add code_debug community environment

590e8a1

fix: correct dataset name to bigcode/humanevalpack

5c2afa8

fix: handle identical scores in process mode with noise instead of None

9e727ce

style: remove debug prints from code_debug_env scoring

d84e4af

chore:add debugging for int server error

287b7e7

refactor:final production-ready audit; remove debug artifacts and non…

8cd30c3

…-ASCII characters

style: apply black and ruff formatting for production standards

feddad9

RUFFY-369 mentioned this pull request Mar 27, 2026

feat(env): integrate CodeDebug environment and vLLM inference stabilization NousResearch/hermes-agent#3448

Open

19 tasks

Merge branch 'NousResearch:main' into feat/code-debug-env

9e00686

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate CodeDebug (HumanEvalFix) environment and core server stabilization#421

feat: integrate CodeDebug (HumanEvalFix) environment and core server stabilization#421
RUFFY-369 wants to merge 8 commits intoNousResearch:mainfrom
RUFFY-369:feat/code-debug-env

RUFFY-369 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RUFFY-369 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

📝 General Information

Description

Related Issues

Type of Change

🔖 Environment Snapshot

🧪 Zero-Training Test Results

✅ Developer & Reviewer Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RUFFY-369 commented Mar 27, 2026 •

edited

Loading