Skip to content

fix: inject num_turns in DAPO reward manager extra_info#15

Open
koriyoshi2041 wants to merge 1 commit into
Gen-Verse:mainfrom
koriyoshi2041:fix/dapo-reward-manager-num-turns
Open

fix: inject num_turns in DAPO reward manager extra_info#15
koriyoshi2041 wants to merge 1 commit into
Gen-Verse:mainfrom
koriyoshi2041:fix/dapo-reward-manager-num-turns

Conversation

@koriyoshi2041

Copy link
Copy Markdown

Closes #13.

recipe/demystify/reward.py shapes the tool-call reward from extra_info.get("num_turns", 0):

num_turns = int(extra_info.get("num_turns", 0))
...
tool_call_reward = (num_turns - 2) / 2 * 0.1

The naive reward manager populates this key from the batch:

# verl/workers/reward_manager/naive.py
extra_info = data_item.non_tensor_batch.get("extra_info", {})
num_turns = data_item.non_tensor_batch.get("__num_turns__", None)
extra_info["num_turns"] = num_turns

…but verl/workers/reward_manager/dapo.py did not. The demystify recipe runs with reward_manager=dapo (see recipe/demystify/eval/eval_qwen3_4b_aime_gpqa.sh), so num_turns was always missing and defaulted to 0, making tool_call_reward a constant (0-2)/2*0.1 = -0.1 no matter how many turns occurred — exactly the behaviour reported in #13.

This mirrors the two lines from naive.py in dapo.py (and switches the extra_info default from None to {}, matching naive.py, so the assignment is always safe). __num_turns__ is written into the batch by the agent loop, so this is just forwarding it.

I verified the change by inspection against the working naive.py path and py_compile; I do not have the training environment to run an end-to-end reward check, so a maintainer review of the runtime path would be appreciated.

The naive reward manager copies __num_turns__ from the batch into
extra_info["num_turns"], but the DAPO reward manager did not. Recipes that
read extra_info["num_turns"] (e.g. demystify's tool-call reward) therefore
always saw 0 under reward_manager=dapo, making tool_call_reward a constant
-0.1 regardless of the actual turn count. Mirror naive.py so DAPO populates
num_turns the same way.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Is tool_call_reward expected to be a fixed -0.1 ?

1 participant