Hi OLMoE team,
I am examining the implementation of the auxiliary MoE losses, specifically the Router Z-Loss ($L_Z$). I've noticed a significant difference in the normalization factor used in the OLMoE implementation compared to other common open-source implementations (e.g., OpenMoE/ColossalAI).
Observed Implementations
The Z-Loss formula is generally based on the L2-norm of the LogSumExp of the router logits: $L_Z \propto \sum_{t} (\log \sum_{e} e^{W_{t,e}})^2$.
| Implementation Style |
Z-Loss Normalization |
Code Reference |
| MegaBlocks/OLMoE Style |
Normalized by $N_{Layers}^{\text{total}} \cdot N_{Tokens} \cdot \mathbf{k}$
|
Link to OLMoE Code |
| OpenMoE Style |
Normalized by $N_{Layers} \cdot N_{Tokens}$ (excludes $\mathbf{k}$) |
Link to OpenMoE Code |
Specific Question
In the MegaBlocks/OLMoE style implementation, the scale_denominator includes the $\text{Top-K}$ factor:
scale_denominator = num_total_moe_layers * T_layer * top_k
# ...
zloss_normalized = zloss_sum_squared / scale_denominator
Why is the $\text{Top-K}$ factor ($k$) included in the Z-Loss denominator?
Since the Z-Loss aims to regularize the magnitude of the raw logits themselves, and the formula does not inherently depend on how many experts ($k$) are actually selected, including $k$ primarily seems to be for loss balancing/magnitude consistency with the Load Balancing Loss ($L_{aux}$).
Could you provide any insight or documentation on the following:
-
Reasoning: What was the primary motivation for including the $\text{Top-K}$ factor in the Z-Loss normalization? Was it for stability, better hyperparameter transfer, or maintaining loss parity with $L_{aux}$?
-
Ablation Studies: Were any ablation studies performed to compare the model performance (e.g., perplexity, convergence speed) between Z-Loss scaled by $\frac{1}{N_{L} \cdot N_{T}}$ versus scaling by $\frac{1}{N_{L} \cdot N_{T} \cdot \mathbf{k}}$?
Thank you for your time and insights!
Hi OLMoE team,
I am examining the implementation of the auxiliary MoE losses, specifically the Router Z-Loss ($L_Z$ ). I've noticed a significant difference in the normalization factor used in the OLMoE implementation compared to other common open-source implementations (e.g., OpenMoE/ColossalAI).
Observed Implementations
The Z-Loss formula is generally based on the L2-norm of the LogSumExp of the router logits:$L_Z \propto \sum_{t} (\log \sum_{e} e^{W_{t,e}})^2$ .
Specific Question
In the MegaBlocks/OLMoE style implementation, the$\text{Top-K}$ factor:
scale_denominatorincludes theWhy is the$\text{Top-K}$ factor ($k$ ) included in the Z-Loss denominator?
Since the Z-Loss aims to regularize the magnitude of the raw logits themselves, and the formula does not inherently depend on how many experts ($k$ ) are actually selected, including $k$ primarily seems to be for loss balancing/magnitude consistency with the Load Balancing Loss ($L_{aux}$ ).
Could you provide any insight or documentation on the following:
Thank you for your time and insights!