Fix rotary transform deferred init and some other fixes by BlueCrescent · Pull Request #419 · Modalities/modalities

BlueCrescent · 2025-11-12T11:46:24Z

What does this PR do?

Mainly fixes a bug causing RotaryTransform to be initialized wrongly when using deferred initialization.
Previously, the inverse frequencies would be initialized to zero (or NaN when using deterministic algorithms).

General Changes

Fixed RotaryTransform bug.
Added a corresponding test for deferred initialization.
Removed some typos.
Added utility functions that were very helpful in debugging this.

Breaking Changes

None

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…red initialization logic.

Copilot

Pull Request Overview

This PR fixes a critical bug in the RotaryTransform class that caused incorrect initialization when using deferred initialization (meta device), where inverse frequencies would be initialized to zero or NaN. The PR also corrects several typos and adds debugging utilities.

Key changes:

Fixed RotaryTransform to properly handle deferred initialization by creating a reset_parameters method that respects the device context
Added a test to verify deferred initialization produces the same weights as eager initialization
Corrected attribute name from attention_config.attention_config to attention_config.qk_norm_config in CausalSelfAttention
Fixed multiple spelling errors: "stoppping" → "stopping" and "savig" → "saving"

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/modalities/models/gpt2/gpt2_model.py	Fixed RotaryTransform deferred initialization bug and corrected CausalSelfAttention attribute name typo
tests/nn/model_initialization/test_deferred_initialization.py	Added comprehensive test to verify deferred vs eager initialization produces identical results
tests/utility.py	Added debug utilities: deterministic CUDA context manager and NaN detection hooks
src/modalities/gym.py	Fixed typo in parameter name: early_stoppping → early_stopping
src/modalities/checkpointing/checkpoint_saving_strategies.py	Fixed typo in parameter name across multiple methods
src/modalities/checkpointing/checkpoint_saving.py	Fixed typo in parameter name and docstring

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…nsform_deferred_init

BlueCrescent · 2025-11-26T12:15:34Z

Note: I checked all the other components going into the GPT2 model and nothing else should be impacted by this bug. However, it would be could to keep this in mind when adding/changing models in the future because neither our code nor PyTorch detects non-shape operations being performed on tensors using the meta device.
Testing with deterministic algorithms activated might be useful.

le1nux

Nice work and good catch of this bug!
I left a few small remarks and ideas.

le1nux · 2025-11-24T17:37:10Z

+
+        self.reset_parameters()
+
+    def reset_parameters(self):


Regarding reset_parameters(), I found this discussion interesting:
https://github.com/pytorch/torchtitan/blob/58fa181ed3543e19c1cff3014f1b61b919d38cd1/torchtitan/models/llama3/model/model.py#L414-L423

not init_weights() is non-pytorch function that they call in the train.py and then initialises all modules that contain weights in a recursive fashion.

For our architecture, it might be interesting to add something like this but with some "initializer" parameter. So that our initialization component just calls this on the top-level model and gives the chosen initialization method as parameter.
Forcing this method to exist would hopefully help to prevent future modules or models to suffer from the same bug as the rotary transforms did.

le1nux · 2025-11-24T19:44:26Z

+        self.reset_parameters()
+
+    def reset_parameters(self):
+        device = self.inv_freq.device if hasattr(self, "inv_freq") else None


A comment would be great why there are two cases where

inv_freq exists and has a device attribute

it does not exist in which case device is set to None

Also the impact of setting device to None below, should be documented.

le1nux · 2025-11-24T19:46:36Z

-        inv_freq = 1.0 / (base_freq ** (torch.arange(0, dim_model, 2).float() / dim_model))
+        self.base_freq = base_freq
+
+        self.reset_parameters()


do we have to call this explicitly in the constructor? Wouldn't our weight init routines call it?

We still want the constructor to build a valid and complete instance of the module. In particular for unit testing or checkpoint loading where this initialization component is not used.

le1nux · 2025-11-27T14:12:44Z

+    module_path: str | None,
+    target: torch.Tensor | list[torch.Tensor] | tuple[torch.Tensor, ...],
+    target_name: str,
+):


we are just logging here. Since we have so much output in the tests and also runs, I would either rename it to has_nan() -> bool and return a flag indicating the presence of NAN or raise an exception additional to the logging

I added a parameter to make this raise exceptions. Adding a return is not possible since it is a hook.

le1nux · 2025-11-27T14:15:45Z

+    _detect_nan(module, module_path, output, "output")
+
+
+def register_nan_hooks(model: torch.nn.Module):


Do we use this, debug_nan_hook and _detect_nan somewhere?

If not, I still think it can be helpful and we could keep it. I just would document it somewhere

I now turned these utility functions into components and also added them to one of the example configs.

le1nux · 2025-11-27T14:37:25Z

+        weight_init_type=WeightInitTypes.SCALED,
+        mean=0.0,
+        std=0.02,
+        num_layers=2,


should we read this from the model directly instead of hardcoding?

what do you mean with from the model?

le1nux · 2025-11-27T14:41:56Z

+    gpt2_model_eager = _apply_initialization(gpt2_model_eager)
+    with torch.device("meta"):
+        gpt2_model_deferred = _build_gpt2_model()
+    gpt2_model_deferred = _apply_initialization(gpt2_model_deferred)


should we add a check that all parameters and buffers are on meta device before init?

Good idea! I added a separate test checking that deferred init params are on meta first then on cuda device.

…d outputs. Also added option to raise an exception in debug_nan_hook.

…ters().

therealdavidos · 2025-11-27T16:04:25Z

where do we use these hooks? Could not find a reference in the code. Would be good to clarify how to use them.

I now turned these utility functions into components and also added them to one of the example configs.

…on meta then on cuda device.

Also used them in the PP example config.

le1nux

LGTM :) Also great to have these debugging components now! 👍

BlueCrescent added 4 commits November 12, 2025 12:33

test(initialization): Added test for deferred initialization.

86fa4b3

fix(initialization): RotaryTransform working correctly with our defer…

79ec823

…red initialization logic.

feat: Added some utilities for debugging.

65b596d

refactor: typos

3aa07be

BlueCrescent requested review from Copilot and le1nux November 12, 2025 11:46

Copilot started reviewing on behalf of BlueCrescent November 12, 2025 11:46 View session

Copilot finished reviewing on behalf of BlueCrescent November 12, 2025 11:47

Copilot AI reviewed Nov 12, 2025

View reviewed changes

Comment thread tests/utility.py Outdated

BlueCrescent added 4 commits November 12, 2025 12:58

refactor: Cleaned up debug_nan_hook code.

691f4d8

chore: Merge remote-tracking branch 'origin/main' into fix_rotary_tra…

a041d20

…nsform_deferred_init

chore: Merge remote-tracking branch 'origin/main' into fix_rotary_tra…

b36f0c5

…nsform_deferred_init

feat(debugging): Moved debugging utils from tests to src.

9131065

therealdavidos self-requested a review November 27, 2025 14:17

le1nux requested changes Nov 27, 2025

View reviewed changes

BlueCrescent added 2 commits November 27, 2025 16:51

feat(logging): Added a print_forward_hook for debugging module in- an…

406ce8c

…d outputs. Also added option to raise an exception in debug_nan_hook.

docs(model): Added additional comment to RotaryTransform.reset_parame…

ae73bce

…ters().

therealdavidos reviewed Nov 27, 2025

View reviewed changes

therealdavidos approved these changes Nov 27, 2025

View reviewed changes

BlueCrescent added 2 commits November 27, 2025 17:57

test(initialization): Added test that deferred init params are first …

c8e6eea

…on meta then on cuda device.

feat: Turned debugging utilities into components.

af06349

Also used them in the PP example config.

le1nux approved these changes Dec 2, 2025

View reviewed changes

BlueCrescent merged commit 9cbb02b into main Dec 2, 2025
3 checks passed

BlueCrescent deleted the fix_rotary_transform_deferred_init branch December 2, 2025 14:03

		_detect_nan(module, module_path, output, "output")


		def register_nan_hooks(model: torch.nn.Module):

Conversation

BlueCrescent commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

BlueCrescent commented Nov 26, 2025

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BlueCrescent commented Nov 12, 2025 •

edited

Loading