Skip to content

feat(pt): optimze HybridMuon by borrowing some ideas from deepseek v4 paper#5424

Open
OutisLi wants to merge 1 commit intodeepmodeling:masterfrom
OutisLi:pr/muon
Open

feat(pt): optimze HybridMuon by borrowing some ideas from deepseek v4 paper#5424
OutisLi wants to merge 1 commit intodeepmodeling:masterfrom
OutisLi:pr/muon

Conversation

@OutisLi
Copy link
Copy Markdown
Collaborator

@OutisLi OutisLi commented Apr 27, 2026

Summary by CodeRabbit

Release Notes

  • Optimizer Configuration Updates
    • Default learning rate adjustment coefficient optimized (0.2 → 0.18) for improved convergence
    • Advanced optimization mode now enabled by default
    • Enhanced numerical stability in optimizer epsilon handling
    • Simplified Gram path configuration for distributed training environments
    • Integrated PyTorch Dynamo cache sizing optimizations for better execution performance

Copilot AI review requested due to automatic review settings April 27, 2026 05:53
@dosubot dosubot Bot added the enhancement label Apr 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the PyTorch HybridMuon optimizer configuration and implementation to align with reported DeepSeek‑V4 calibration choices (two-stage Newton–Schulz schedule and match-RMS coefficient), and adjusts related training/config plumbing.

Changes:

  • Change HybridMuon match-RMS scaling default (lr_adjust_coeff) from 0.2 to 0.18 and update documentation accordingly.
  • Remove the training-time auto-disable of enable_gram in distributed mode.
  • Update HybridMuon’s orthogonalization and Adam epsilon behavior (two-stage Newton–Schulz; ADAM_EPS=1e-20) and introduce a Dynamo cache size tweak.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
deepmd/utils/argcheck.py Updates HybridMuon optimizer defaults/docs (notably lr_adjust_coeff) and removes distributed-disable wording for enable_gram.
deepmd/pt/train/training.py Changes HybridMuon optimizer construction to pass enable_gram directly (no longer forced off under distributed training).
deepmd/pt/optimizer/hybrid_muon.py Implements the DeepSeek‑style two-stage Newton–Schulz schedule, updates defaults/docs, changes Adam epsilon handling, and adjusts Dynamo cache behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deepmd/pt/optimizer/hybrid_muon.py
Comment thread deepmd/pt/optimizer/hybrid_muon.py
Comment thread deepmd/pt/optimizer/hybrid_muon.py
Comment thread deepmd/pt/optimizer/hybrid_muon.py
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

The pull request updates the HybridMuon optimizer with a two-stage Newton-Schulz orthogonalization schedule (fast and polish phases), introduces distinct epsilon values for Adam operations, adjusts default parameters, and integrates PyTorch Dynamo cache sizing. It also simplifies distributed training logic for the enable_gram flag.

Changes

Cohort / File(s) Summary
HybridMuon Optimizer Implementation
deepmd/pt/optimizer/hybrid_muon.py
Replaces single-stage Newton-Schulz quintic iteration with two-stage hybrid schedule (fast + polish phases). Introduces NS_STEPS_FAST, NS_COEFF_FAST, NS_STEPS_POLISH, NS_COEFF_POLISH coefficients. Adds distinct ADAM_EPS=1e-20 for Adam denominator computations. Updates default lr_adjust_coeff from 0.2 to 0.18 and magma_muon from False to True. Adds PyTorch Dynamo cache sizing adjustment.
Configuration & Documentation Updates
deepmd/utils/argcheck.py
Updates lr_adjust_coeff default value and help text to reflect new 0.18 default. Removes statement that compiled Gram Newton-Schulz is disabled during distributed training.
Training Integration
deepmd/pt/train/training.py
Simplifies enable_gram configuration logic to always derive directly from opt_param instead of applying conditional disable for distributed training scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • PR #5412: Modifies the same HybridMuon optimizer internals, including enable_gram handling and magma_muon default behavior
  • PR #5130: Updates Muon-family optimizers' Newton-Schulz orthogonalization, iteration logic, and optimizer parameter defaults
  • PR #5149: Refactors hybrid_muon.py and introduces Newton-Schulz schedule and epsilon handling constants used in this PR

Suggested labels

enhancement, Python

Suggested reviewers

  • wanghan-iapcm
  • njzjz
  • iProzd
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change—optimizing HybridMuon with ideas from DeepSeek V4—which aligns with the detailed changes across three files implementing a two-stage hybrid schedule, Adam epsilon separation, and parameter defaults.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deepmd/pt/optimizer/hybrid_muon.py (1)

120-129: Avoid mutating torch._dynamo.config.cache_size_limit at module import time.

Importing this module unconditionally bumps a process-global PyTorch Dynamo setting, even for callers that never instantiate HybridMuonOptimizer (e.g., serving / inference entry points that merely import the optimizer registry). The cache-size bump is only needed for _GramNewtonSchulzOrthogonalizer._compiled_call (the sole torch.compile site in this file), so the side effect should be scoped to where it's actually required.

♻️ Move the bump into `_GramNewtonSchulzOrthogonalizer.__init__`
 import torch
-import torch._dynamo.config as _dynamo_config
 from torch.optim.optimizer import (
     Optimizer,
 )
 
 DYNAMO_CACHE_SIZE_LIMIT = 64
-_dynamo_config.cache_size_limit = max(
-    int(_dynamo_config.cache_size_limit),
-    DYNAMO_CACHE_SIZE_LIMIT,
-)
     def __init__(self) -> None:
+        # Gram NS compiles per-shape; bump Dynamo's cache budget locally so the
+        # repeated recompilation across MLIP parameter shapes doesn't spill,
+        # without polluting global state for unrelated torch.compile users.
+        import torch._dynamo.config as _dynamo_config
+
+        _dynamo_config.cache_size_limit = max(
+            int(_dynamo_config.cache_size_limit),
+            DYNAMO_CACHE_SIZE_LIMIT,
+        )
         # Gram path uses NS_EPS (same numerical role as Standard NS norm clamp).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7a925eec-d673-4562-93b0-abf4226eb0cf

📥 Commits

Reviewing files that changed from the base of the PR and between 9d63816 and 06b9b59.

📒 Files selected for processing (3)
  • deepmd/pt/optimizer/hybrid_muon.py
  • deepmd/pt/train/training.py
  • deepmd/utils/argcheck.py

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.42%. Comparing base (9d63816) to head (06b9b59).

Files with missing lines Patch % Lines
deepmd/pt/optimizer/hybrid_muon.py 75.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5424      +/-   ##
==========================================
+ Coverage   82.39%   82.42%   +0.03%     
==========================================
  Files         824      824              
  Lines       87395    87418      +23     
  Branches     4197     4197              
==========================================
+ Hits        72009    72055      +46     
+ Misses      14111    14088      -23     
  Partials     1275     1275              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants