Skip to content

Question on Router Z-Loss Normalization: Why Include Top-K in the Scaling Denominator? #39

Description

@james-yw

Hi OLMoE team,

I am examining the implementation of the auxiliary MoE losses, specifically the Router Z-Loss ($L_Z$). I've noticed a significant difference in the normalization factor used in the OLMoE implementation compared to other common open-source implementations (e.g., OpenMoE/ColossalAI).

Observed Implementations

The Z-Loss formula is generally based on the L2-norm of the LogSumExp of the router logits: $L_Z \propto \sum_{t} (\log \sum_{e} e^{W_{t,e}})^2$.

Implementation Style Z-Loss Normalization Code Reference
MegaBlocks/OLMoE Style Normalized by $N_{Layers}^{\text{total}} \cdot N_{Tokens} \cdot \mathbf{k}$ Link to OLMoE Code
OpenMoE Style Normalized by $N_{Layers} \cdot N_{Tokens}$ (excludes $\mathbf{k}$) Link to OpenMoE Code

Specific Question

In the MegaBlocks/OLMoE style implementation, the scale_denominator includes the $\text{Top-K}$ factor:

scale_denominator = num_total_moe_layers * T_layer * top_k
# ...
zloss_normalized = zloss_sum_squared / scale_denominator 

Why is the $\text{Top-K}$ factor ($k$) included in the Z-Loss denominator?

Since the Z-Loss aims to regularize the magnitude of the raw logits themselves, and the formula does not inherently depend on how many experts ($k$) are actually selected, including $k$ primarily seems to be for loss balancing/magnitude consistency with the Load Balancing Loss ($L_{aux}$).

Could you provide any insight or documentation on the following:

  1. Reasoning: What was the primary motivation for including the $\text{Top-K}$ factor in the Z-Loss normalization? Was it for stability, better hyperparameter transfer, or maintaining loss parity with $L_{aux}$?
  2. Ablation Studies: Were any ablation studies performed to compare the model performance (e.g., perplexity, convergence speed) between Z-Loss scaled by $\frac{1}{N_{L} \cdot N_{T}}$ versus scaling by $\frac{1}{N_{L} \cdot N_{T} \cdot \mathbf{k}}$?

Thank you for your time and insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions