Log gradient clipping norm and frequency by josejg · Pull Request #3915 · mosaicml/composer

josejg · 2025-07-28T19:44:27Z

Context

Currently torch.nn.utils.clip_grad_norm_ and FSDP.clip_grad_norm_ apply the gradient normalization in place but also return the pre-clip gradient norm value, however the value is not capture nor logged anywhere.

We can't change the API for all gradient clipping methods since some don't have top level scalar, but we can for gradient norm clipping, the most frequent one we use.

This PR propagates the value outside of the helper and into the algorithm where it can be logged. Since clipping is fairly bursty, we also compute the rolling window over clipping_frequency_window samples to provide a more parseable metric.

Known caveats

_clipping_history is not persisted so the metric will change slightly upon resumption
If gradient clipping is not enabled, the gradient norm won't be logged. For logging without clipping the best solution is to set clipping to really high threshold, e.g. clipping_threshold: 100

Experiments

Couple of example experiments that showcase the functionality with SFT and GRPO are 2025-07-21-debug-gradient-clipping and 2025-07-25-math-rlvr-grpo respectively

irenedea · 2025-07-30T17:03:46Z

        fsdp_enabled (bool): Bool of if the model is a FSDP model or not.
+
+    Returns:
+        Union[torch.Tensor, None]: The total gradient norm before clipping for 'norm' clipping type,


Let's just always return the result, not just for norm.

It's a weird contract for a separate function downstream to know this behavior

The downstream function always guards by checking if the clipping_type is norm, so that should be good enough

I thought about it but the other two options don't have top-level scalar values that can be returned:

value - returns None (see https://github.com/pytorch/pytorch/blob/v2.7.0/torch/nn/utils/clip_grad.py#L250)

adaptive - is an elementwise op, so there isn't a clear value to return

Hence why I stuck with returning None for those. I agree that the contract is awkward, but we needed to propagate the norm to the logger.

oh lol, why do they modify in place for value and not for norm 😆 😓

approved, not ideal, but makes sense that this is the best we can do

josejg added 2 commits July 28, 2025 12:22

Log gradient clipping norm and frequency

9d2e715

Auto-format files with yapf after pre-commit

3a26180

josejg requested a review from a team as a code owner July 28, 2025 19:44

irenedea reviewed Jul 30, 2025

View reviewed changes

josejg requested a review from irenedea July 30, 2025 17:33

irenedea approved these changes Jul 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log gradient clipping norm and frequency#3915

Log gradient clipping norm and frequency#3915
josejg wants to merge 2 commits intomainfrom
josejg/grad-clip-log

josejg commented Jul 28, 2025 •

edited

Loading

Uh oh!

irenedea Jul 30, 2025

Uh oh!

irenedea Jul 30, 2025

Uh oh!

josejg Jul 30, 2025

Uh oh!

irenedea Jul 30, 2025

Uh oh!

irenedea Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

josejg commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Known caveats

Experiments

Uh oh!

irenedea Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

irenedea Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

josejg Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

irenedea Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

irenedea Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

josejg commented Jul 28, 2025 •

edited

Loading