First of all, thank you for your excellent work and for open-sourcing the technical report and model checkpoints. I’ve been going through the paper and have a question regarding the Variance Alignment Design for Scalability section.
From my understanding, without variance compensation, the variances of the query (q), key (k), value (v) and outputs—derived from the MLA module—can grow according to d_q d_kv and d_model . With the input hidden states of the first MLA layer have a fixed variance per layer, this variance may accumulate across layers due to the scaling effects of dimensionality.
As a result, when scaling up the model size, the output variance of the final layer becomes excessively large, which may destabilize training and degrade performance.
Therefore, the variance compensation mechanism is introduced to align and stabilize the output variance across layers, ensuring it remains bounded and independent of model scale. This promotes stable training and better scalability.
Could you please confirm whether this interpretation is correct? Any clarification on the design rationale or the mathematical formulation behind the variance compensation would be greatly appreciated.
Thanks again for your insightful work!
Best regards
First of all, thank you for your excellent work and for open-sourcing the technical report and model checkpoints. I’ve been going through the paper and have a question regarding the Variance Alignment Design for Scalability section.
From my understanding, without variance compensation, the variances of the query (q), key (k), value (v) and outputs—derived from the MLA module—can grow according to d_q d_kv and d_model . With the input hidden states of the first MLA layer have a fixed variance per layer, this variance may accumulate across layers due to the scaling effects of dimensionality.
As a result, when scaling up the model size, the output variance of the final layer becomes excessively large, which may destabilize training and degrade performance.
Therefore, the variance compensation mechanism is introduced to align and stabilize the output variance across layers, ensuring it remains bounded and independent of model scale. This promotes stable training and better scalability.
Could you please confirm whether this interpretation is correct? Any clarification on the design rationale or the mathematical formulation behind the variance compensation would be greatly appreciated.
Thanks again for your insightful work!
Best regards