Skip to content

feat: integrate CodeDebug (HumanEvalFix) environment and core server stabilization#421

Open
RUFFY-369 wants to merge 8 commits intoNousResearch:mainfrom
RUFFY-369:feat/code-debug-env
Open

feat: integrate CodeDebug (HumanEvalFix) environment and core server stabilization#421
RUFFY-369 wants to merge 8 commits intoNousResearch:mainfrom
RUFFY-369:feat/code-debug-env

Conversation

@RUFFY-369
Copy link
Copy Markdown

@RUFFY-369 RUFFY-369 commented Mar 27, 2026

Note

Research Context: This PR provides the underlying CodeDebugEnv backend required for the agentic reasoning and vLLM stabilization benchmarks in hermes-agent (PR #3448).

PR Type

  • RL Environment PR - Complete Environment Snapshot & Zero-Training sections
  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

This pull request introduces the CodeDebug environment for standardized debugging benchmarks and includes critical stabilization for the core server handling layer.

  • New Environment: Integrated environments/community/code_debug_env/ based on HumanEvalFix.
  • Server Stabilization: Refactored openai_server.py to use structured logging and robust completion handling, removing legacy debug prints and emojis.
  • Safe Execution: Added code_executor.py for isolated, execution-based scoring of code fixes.

Related Issues

Closes (No specific issue linked, new feature/env)

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Code refactor (no functional changes)
  • Documentation update

🔖 Environment Snapshot

Field Your Entry
Environment Name CodeDebug-v0
Short Description Standardized debugging benchmark based on HumanEvalFix.
Category Verifiable-Reasoning
Dataset Needed? Yes (bigcode/humanevalpack)
External Deps datasets, tqdm
Environmental Variables N/A
Compute Footprint Estimate <1 GB RAM, <2 sec per verification

🧪 Zero-Training Test Results

Details

W&B Link: N/A (Initial Integration)

Examples of the Environment scoring:

  • Good Fix (Score 1.0): Correctly identifies a syntax error or logic flaw and provides a passing implementation.
  • Partial Improvement (Score Variable): Provides code that passes more tests than the original buggy version but does not yet satisfy the full suite.
  • Failure (Score -1.0): Code fails to compile or makes the logic worse.

✅ Developer & Reviewer Checklist

  • Code follows project style (black, ruff pass)
  • I have performed a self-review of my own code
  • I have commented my code appropriately
  • My changes generate no new warnings
  • New and existing unit tests pass locally (test_code_debug.py)
  • Docstrings added for all new public classes / functions

cc @dmahan93 @teknium1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant