Checklist / 检查清单
Question Description / 问题描述
The official example examples/train/on_policy_distillation.sh pairs:
- student:
Qwen/Qwen3-8B-Base — eos_token = <|endoftext|> (id 151643)
- teacher:
Qwen/Qwen3-32B (instruct) — eos_token = <|im_end|> (id 151645)
so the student and teacher use different EOS tokens. (I verified the same convention
locally on the same family: Qwen3-4B-Base → 151643, Qwen3-14B → 151645.)
What happens
Running the example as-is (--lmbda 1 --beta 1, i.e. pure on-policy + reverse KL),
after only a few dozen steps:
completions/mean_length blows up and keeps hitting max_completion_length,
- eval performance collapses,
Questions
We are hitting infinite repetition and exploding response length with this example (Base student + Instruct teacher) — how did you solve it on your side?
Checklist / 检查清单
Question Description / 问题描述
The official example
examples/train/on_policy_distillation.shpairs:Qwen/Qwen3-8B-Base—eos_token = <|endoftext|>(id 151643)Qwen/Qwen3-32B(instruct) —eos_token = <|im_end|>(id 151645)so the student and teacher use different EOS tokens. (I verified the same convention
locally on the same family:
Qwen3-4B-Base→ 151643,Qwen3-14B→ 151645.)What happens
Running the example as-is (
--lmbda 1 --beta 1, i.e. pure on-policy + reverse KL),after only a few dozen steps:
completions/mean_lengthblows up and keeps hittingmax_completion_length,Questions
We are hitting infinite repetition and exploding response length with this example (Base student + Instruct teacher) — how did you solve it on your side?