fix: inject num_turns in DAPO reward manager extra_info by koriyoshi2041 · Pull Request #15 · Gen-Verse/Open-AgentRL

koriyoshi2041 · 2026-05-29T17:39:38Z

Closes #13.

recipe/demystify/reward.py shapes the tool-call reward from extra_info.get("num_turns", 0):

num_turns = int(extra_info.get("num_turns", 0))
...
tool_call_reward = (num_turns - 2) / 2 * 0.1

The naive reward manager populates this key from the batch:

# verl/workers/reward_manager/naive.py
extra_info = data_item.non_tensor_batch.get("extra_info", {})
num_turns = data_item.non_tensor_batch.get("__num_turns__", None)
extra_info["num_turns"] = num_turns

…but verl/workers/reward_manager/dapo.py did not. The demystify recipe runs with reward_manager=dapo (see recipe/demystify/eval/eval_qwen3_4b_aime_gpqa.sh), so num_turns was always missing and defaulted to 0, making tool_call_reward a constant (0-2)/2*0.1 = -0.1 no matter how many turns occurred — exactly the behaviour reported in #13.

This mirrors the two lines from naive.py in dapo.py (and switches the extra_info default from None to {}, matching naive.py, so the assignment is always safe). __num_turns__ is written into the batch by the agent loop, so this is just forwarding it.

I verified the change by inspection against the working naive.py path and py_compile; I do not have the training environment to run an end-to-end reward check, so a maintainer review of the runtime path would be appreciated.

The naive reward manager copies __num_turns__ from the batch into extra_info["num_turns"], but the DAPO reward manager did not. Recipes that read extra_info["num_turns"] (e.g. demystify's tool-call reward) therefore always saw 0 under reward_manager=dapo, making tool_call_reward a constant -0.1 regardless of the actual turn count. Mirror naive.py so DAPO populates num_turns the same way.

koriyoshi2041 mentioned this pull request May 29, 2026

Is tool_call_reward expected to be a fixed -0.1 ? #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: inject num_turns in DAPO reward manager extra_info#15

fix: inject num_turns in DAPO reward manager extra_info#15
koriyoshi2041 wants to merge 1 commit into
Gen-Verse:mainfrom
koriyoshi2041:fix/dapo-reward-manager-num-turns

koriyoshi2041 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

koriyoshi2041 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant