examples/train/on_policy_distillation.sh mixes a Base student with an Instruct teacher — mismatched EOS tokens cause length explosion / infinite repetition

### Checklist / 检查清单

- [ ] I have searched existing issues, and this is a new question or discussion topic. / 我已经搜索过现有的 issues，确认这是一个新的问题与讨论。

### Question Description / 问题描述

The official example `examples/train/on_policy_distillation.sh` pairs:
- **student**: `Qwen/Qwen3-8B-Base` — `eos_token = <|endoftext|>` (id **151643**)
- **teacher**: `Qwen/Qwen3-32B` (instruct) — `eos_token = <|im_end|>` (id **151645**)
so the student and teacher use **different EOS tokens**. (I verified the same convention
locally on the same family: `Qwen3-4B-Base` → 151643, `Qwen3-14B` → 151645.)
### What happens
Running the example as-is (`--lmbda 1 --beta 1`, i.e. pure on-policy + reverse KL),
after only a few dozen steps:
- `completions/mean_length` blows up and keeps hitting `max_completion_length`,
- eval performance collapses,
### Questions
We are hitting infinite repetition and exploding response length with this example  (Base student + Instruct teacher)  — how did you solve it on your side? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/train/on_policy_distillation.sh mixes a Base student with an Instruct teacher — mismatched EOS tokens cause length explosion / infinite repetition #9526

Checklist / 检查清单

Question Description / 问题描述

What happens

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

examples/train/on_policy_distillation.sh mixes a Base student with an Instruct teacher — mismatched EOS tokens cause length explosion / infinite repetition #9526

Description

Checklist / 检查清单

Question Description / 问题描述

What happens

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions