Skip to content

examples/train/on_policy_distillation.sh mixes a Base student with an Instruct teacher — mismatched EOS tokens cause length explosion / infinite repetition #9526

@zandfj

Description

@zandfj

Checklist / 检查清单

  • I have searched existing issues, and this is a new question or discussion topic. / 我已经搜索过现有的 issues,确认这是一个新的问题与讨论。

Question Description / 问题描述

The official example examples/train/on_policy_distillation.sh pairs:

  • student: Qwen/Qwen3-8B-Baseeos_token = <|endoftext|> (id 151643)
  • teacher: Qwen/Qwen3-32B (instruct) — eos_token = <|im_end|> (id 151645)
    so the student and teacher use different EOS tokens. (I verified the same convention
    locally on the same family: Qwen3-4B-Base → 151643, Qwen3-14B → 151645.)

What happens

Running the example as-is (--lmbda 1 --beta 1, i.e. pure on-policy + reverse KL),
after only a few dozen steps:

  • completions/mean_length blows up and keeps hitting max_completion_length,
  • eval performance collapses,

Questions

We are hitting infinite repetition and exploding response length with this example (Base student + Instruct teacher) — how did you solve it on your side?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions