Skip to content

Clarification on the Purpose of Variance Compensation in Section 2.3.2 #33

@cavalier501

Description

@cavalier501

First of all, thank you for your excellent work and for open-sourcing the technical report and model checkpoints. I’ve been going through the paper and have a question regarding the Variance Alignment Design for Scalability section.

From my understanding, without variance compensation, the variances of the query (q), key (k), value (v) and outputs—derived from the MLA module—can grow according to d_q d_kv and d_model . With the input hidden states of the first MLA layer have a fixed variance per layer, this variance may accumulate across layers due to the scaling effects of dimensionality.

As a result, when scaling up the model size, the output variance of the final layer becomes excessively large, which may destabilize training and degrade performance.

Therefore, the variance compensation mechanism is introduced to align and stabilize the output variance across layers, ensuring it remains bounded and independent of model scale. This promotes stable training and better scalability.

Could you please confirm whether this interpretation is correct? Any clarification on the design rationale or the mathematical formulation behind the variance compensation would be greatly appreciated.

Thanks again for your insightful work!
Best regards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions