fix: inject num_turns in DAPO reward manager extra_info#15
Open
koriyoshi2041 wants to merge 1 commit into
Open
fix: inject num_turns in DAPO reward manager extra_info#15koriyoshi2041 wants to merge 1 commit into
koriyoshi2041 wants to merge 1 commit into
Conversation
The naive reward manager copies __num_turns__ from the batch into extra_info["num_turns"], but the DAPO reward manager did not. Recipes that read extra_info["num_turns"] (e.g. demystify's tool-call reward) therefore always saw 0 under reward_manager=dapo, making tool_call_reward a constant -0.1 regardless of the actual turn count. Mirror naive.py so DAPO populates num_turns the same way.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #13.
recipe/demystify/reward.pyshapes the tool-call reward fromextra_info.get("num_turns", 0):The naive reward manager populates this key from the batch:
…but
verl/workers/reward_manager/dapo.pydid not. The demystify recipe runs withreward_manager=dapo(seerecipe/demystify/eval/eval_qwen3_4b_aime_gpqa.sh), sonum_turnswas always missing and defaulted to0, makingtool_call_rewarda constant(0-2)/2*0.1 = -0.1no matter how many turns occurred — exactly the behaviour reported in #13.This mirrors the two lines from
naive.pyindapo.py(and switches theextra_infodefault fromNoneto{}, matchingnaive.py, so the assignment is always safe).__num_turns__is written into the batch by the agent loop, so this is just forwarding it.I verified the change by inspection against the working
naive.pypath andpy_compile; I do not have the training environment to run an end-to-end reward check, so a maintainer review of the runtime path would be appreciated.