Checklist
Background
Current AReaL implementation supports knowledge distillation from a single teacher model in both:
On-policy Reverse KL distillation (RKL)
Combined GRPO + KD (KDRL)
However, many practical training setups benefit from using multiple teacher models simultaneously, where different teachers specialize in different capabilities (e.g., reasoning, instruction-following, domain-specific skills).
This motivates extending the current framework to support multi-teacher distillation via a weighted mixture distribution.
Potential Solution
A straightforward implementation is:
- Each teacher computes token-level log-probabilities for sampled trajectories.
- Teacher outputs are stacked across the teacher dimension.
- Mixture distribution is computed in log space using a numerically stable aggregation:
3.1. Normalize or log-scale teacher weights
3.2. Combine log-probabilities via log-sum-exp
- The resulting teacher_logp is treated as a single unified teacher signal.
This design has the advantage that:
- No changes are required to existing KD or KDRL objectives
- Backward compatibility with single-teacher setups is preserved
- Works uniformly for both rollout and train engines
Additional Information
This feature is particularly useful for:
- Mixing large and small teacher models
- Combining domain-specialized teachers (e.g., math + coding)
- Ensembling checkpoints from different training stages
- Improving robustness of distillation signal under teacher disagreement
It is also a natural extension of current on-policy distillation, as the student still samples trajectories from its own policy while receiving supervision from a richer, aggregated teacher distribution.
Checklist
areal/api/. If not, please raise a refactor issue first.Background
Current AReaL implementation supports knowledge distillation from a single teacher model in both:
On-policy Reverse KL distillation (RKL)
Combined GRPO + KD (KDRL)
However, many practical training setups benefit from using multiple teacher models simultaneously, where different teachers specialize in different capabilities (e.g., reasoning, instruction-following, domain-specific skills).
This motivates extending the current framework to support multi-teacher distillation via a weighted mixture distribution.
Potential Solution
A straightforward implementation is:
3.1. Normalize or log-scale teacher weights
3.2. Combine log-probabilities via log-sum-exp
This design has the advantage that:
Additional Information
This feature is particularly useful for:
It is also a natural extension of current on-policy distillation, as the student still samples trajectories from its own policy while receiving supervision from a richer, aggregated teacher distribution.