Question on Router Z-Loss Normalization: Why Include Top-K in the Scaling Denominator?

Hi OLMoE team,

I am examining the implementation of the auxiliary MoE losses, specifically the **Router Z-Loss** ($L_Z$). I've noticed a significant difference in the normalization factor used in the OLMoE implementation compared to other common open-source implementations (e.g., OpenMoE/ColossalAI).

### Observed Implementations

The Z-Loss formula is generally based on the L2-norm of the LogSumExp of the router logits: $L_Z \propto \sum_{t} (\log \sum_{e} e^{W_{t,e}})^2$.

| Implementation Style | Z-Loss Normalization | Code Reference |
| :--- | :--- | :--- |
| **MegaBlocks/OLMoE Style** | Normalized by $N_{Layers}^{\text{total}} \cdot N_{Tokens} \cdot \mathbf{k}$ | [Link to OLMoE Code](https://github.com/Muennighoff/megablocks/blob/olmoe/megablocks/layers/moe.py#L103) |
| **OpenMoE Style** | Normalized by $N_{Layers} \cdot N_{Tokens}$ (**excludes** $\mathbf{k}$) | [Link to OpenMoE Code](https://github.com/Orion-Zheng/ColossalAI/blob/my_openmoe/colossalai/moe/routers.py#L101) |

### Specific Question

In the MegaBlocks/OLMoE style implementation, the `scale_denominator` includes the $\text{Top-K}$ factor:

```python
scale_denominator = num_total_moe_layers * T_layer * top_k
# ...
zloss_normalized = zloss_sum_squared / scale_denominator 
```

**Why is the $\text{Top-K}$ factor ($k$) included in the Z-Loss denominator?**

Since the Z-Loss aims to regularize the magnitude of the raw logits themselves, and the formula does not inherently depend on how many experts ($k$) are actually selected, including $k$ primarily seems to be for **loss balancing/magnitude consistency** with the Load Balancing Loss ($L_{aux}$).

Could you provide any insight or documentation on the following:

1.  **Reasoning:** What was the primary motivation for including the $\text{Top-K}$ factor in the Z-Loss normalization? Was it for stability, better hyperparameter transfer, or maintaining loss parity with $L_{aux}$?
2.  **Ablation Studies:** Were any ablation studies performed to compare the model performance (e.g., perplexity, convergence speed) between Z-Loss scaled by $\frac{1}{N_{L} \cdot N_{T}}$ versus scaling by $\frac{1}{N_{L} \cdot N_{T} \cdot \mathbf{k}}$?

Thank you for your time and insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question on Router Z-Loss Normalization: Why Include Top-K in the Scaling Denominator? #39

Observed Implementations

Specific Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Implementation Style	Z-Loss Normalization	Code Reference
MegaBlocks/OLMoE Style	Normalized by $N_{Layers}^{\text{total}} \cdot N_{Tokens} \cdot \mathbf{k}$	Link to OLMoE Code
OpenMoE Style	Normalized by $N_{Layers} \cdot N_{Tokens}$ (excludes $\mathbf{k}$)	Link to OpenMoE Code

Uh oh!

Question on Router Z-Loss Normalization: Why Include Top-K in the Scaling Denominator? #39

Description

Observed Implementations

Specific Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions