Skip to content

JiwenJ/Awesome-Optimizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Optimizers List

Awesome GitHub License: MIT Papers Code Links Coverage PRs Welcome

Curated optimizer-design papers from 2022+, ordered by exact publication or submission date in reverse chronological order.

  • CSV: data/optimizers.csv
  • Weakly related Muon papers: data/muon_weakly_related.csv
  • Date source: arXiv published date for arXiv papers; source-page submission date for OpenReview/blog entries.
  • Note: SRON is a legacy unsourced row without a paper URL, so its date remains month-level until the source paper is verified.
Date Optimizer Name Summary
2026-06-16 MGUP Code Selects a fixed fraction of parameters for larger momentum-gradient-aligned steps while giving the rest smaller nonzero updates; works as a plug-in wrapper for AdamW, Lion, and Muon with convergence guarantees and LLM training experiments.
2026-06-15 Hyperball Optimization Constrains both weight-matrix and optimizer-update Frobenius norms to fixed constants around a base optimizer such as Adam or Muon; reports 20-30% token-equivalent speedups on Qwen3-style models up to 1.2B and better learning-rate transfer.
2026-06-15 CacheMuon Caches temporally correlated polar-factor information from previous steps to precondition Muon's Newton-Schulz computation, reducing repeated orthogonalization cost while exposing a quality-efficiency trade-off.
2026-06-12 Free Heavy-Tailed Lunch for Muon Gives heavy-tailed nonconvex theory showing matrix-valued non-Euclidean optimizers such as Muon and Scion can avoid dimension-dependent costs and achieve stronger stationarity guarantees than Euclidean methods.
2026-06-11 Muon^p Interpolates between gradient descent and Muon by using fractional spectral-power updates U S^p V^T; derives practical low-degree bivariate matrix recurrences because fixed univariate polynomial iterations cannot compute the required powers.
2026-06-11 LoRA-Muon Derives a Muon-style spectral steepest-descent rule on the low-rank LoRA manifold, paired with split weight decay, to reduce initialization and stepsize sensitivity and improve rank/width/depth learning-rate transfer.
2026-06-09 FOGO Frames forgetting as step-level gradient interference, then spectrally orthogonalizes momentum updates and keeps a compact codebook of past directions so dominant minibatch gradients do not erase rare useful directions.
2026-06-08 Muon Robust Transfer Evaluates pretrained models under corrupted image/text shifts and layer-wise probes, finding Muon-trained features more robust and transferable than Adam or SGD features across transformer and CNN settings.
2026-06-07 OptMuon Combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm coefficient schedule, turning orthogonalized momentum into a closed-loop method that calibrates update magnitudes from observed optimization history.
2026-06-07 Muon Spectral Dynamics Analyzes Muon's polar update as a flat-spectrum, entropy-maximizing bias under alignment assumptions, deriving singular-value dynamics and showing how the update geometry changes noise behavior rather than merely rescaling gradients.
2026-06-03 Why Muon Outperforms Adam: A Curvature Perspective Uses second-order Taylor decomposition to show Muon and Adam have similar first-order gain at matched validation loss, while Muon pays a smaller curvature penalty through lower normalized directional sharpness.
2026-06-02 Spectral Scaling Laws of Muon Measures singular-value quantiles of Muon momentum matrices across model sizes and layers, identifying which directions finite Newton-Schulz misses and turning the spectra into layer-aware orthogonalization guidance.
2026-06-02 Denoise First, Orthogonalize Later Models Muon momentum as a spectral filter that suppresses perturbation modes before orthogonalization, increasing the signal-perturbation gap and stabilizing the singular subspaces passed into Newton-Schulz/polar updates.
2026-06-01 A Note on Stability for Orthogonalized Matrix Momentum Proves finite-round generalization bounds for client-sampled distributed optimization with orthogonalized matrix momentum, explicitly tracking heterogeneous client sampling and finite-step Newton-Schulz effects.
2026-05-29 How Much Orthogonalization Does Muon Need? Studies relaxed low-precision Newton-Schulz schedules for Muon and shows cheaper partial orthogonalization can preserve training behavior, separating the need for useful spectral shaping from exact polar accuracy.
2026-05-29 Softsign / SoftMuon Introduces SoftSignum, replacing hard sign updates with a temperature-controlled soft-sign map so updates can move between sign-like and magnitude-sensitive behavior; extends the same relaxation to matrix optimizers as SoftMuon with quantile temperature scheduling.
2026-05-26 Entry-Wise Clipping for Muon Models language-model gradient noise as entry-wise heavy-tailed contamination, derives an entry-wise clipping surrogate that can control spectral noise, and positions it as a cheaper structural alternative or complement to Muon-style spectral normalization.
2026-05-26 Spectral Descent (SD/TSD) Studies simplified Muon-like Spectral Descent and Truncated Spectral Descent under non-smooth convex objectives, proving linear convergence under sharpness and connecting regularized variants to Frank-Wolfe-style sublinear guarantees.
2026-05-26 Muon Adversarial Training Tests whether Muon-style orthogonalized matrix updates improve adversarial training, deriving a spectral-norm stability ceiling and evaluating robustness across architectures and heterogeneous threat models.
2026-05-26 MONA Adds a Nesterov-like acceleration term from an EMA of gradient differences directly into Muon before orthogonalization, with convergence analysis arguing it helps escape sharp minima while preserving Muon spectral regularization.
2026-05-26 MuCon Replaces Muon's polar direction, which maps all singular values to one, with singular-value-clipped updates under a SpectralP scaling recipe, using a spectral-norm-ball projection to retain controlled magnitude information.
2026-05-25 EMA-Nesterov Reinterprets Nesterov acceleration as trajectory extrapolation and replaces noisy one-step lookahead with an EMA of parameter updates, yielding a wrapper that stabilizes acceleration for stochastic nonconvex deep-learning training.
2026-05-23 Muon in Vision Transformers Benchmarks Muon against AdamW for ViT training on ImageNet-100 and Pl@ntNet-300K, showing recipe-dependent gains and linking heavy augmentation to healthier matrix-gradient spectra and less late-training mode collapse.
2026-05-22 Regularized Muon Flow Interprets smoothed Muon orthogonalization as the gradient of a Fenchel-dual nuclear-norm smoothing, recasting Muon as a mirror/prox update and deriving Hamiltonian probability-gradient-flow dynamics for mean-field training views.
2026-05-21 AMUSE Explains Muon through a river-valley loss-landscape picture where orthogonalization speeds flat-direction progress but amplifies dominant-direction oscillations; combines Muon with schedule-free iterate averaging for anytime stable gradient evaluation.
2026-05-21 Layerwise Learning Rates (LLR) Uses Heavy-Tailed Self-Regularization estimates from layer weight spectra to assign larger learning rates to weakly heavy-tailed layers and smaller rates to strongly heavy-tailed layers, reporting faster AdamW and Muon LLM pretraining.
2026-05-19 LionMuon Alternates cheap Lion/sign-style steps with expensive Muon spectral steps using a shared dual-EMA buffer, matching Lion memory while reducing average cost and reporting Pareto gains over Muon, Lion, Signum, and AdamW.
2026-05-19 Schatten-p Adaptive Optimization Derives a data-driven criterion from gradient and activation statistics to choose layer-wise Schatten-p LMO geometries, interpolating between SGD, Muon, Adam, and MuAdam-style updates instead of fixing the optimizer geometry.
2026-05-19 MiMuon Analyzes Muon generalization via algorithmic stability, then mixes Muon orthogonalized gradients with momentum SGD to improve generalization while retaining fast matrix-aware convergence for large models.
2026-05-19 High-Pass Pion Identifies failure modes of full Muon whitening in VLA training and RLVR, where low-rank or low-SNR gradients make tail amplification harmful; proposes Pion as a high-pass spectral filter that promotes useful directions while suppressing noise.
2026-05-18 Distance-Aware Muon Develops adaptive step-scaling rules for normalized Muon directions, using trajectory distance, scale calibration, or descent certificates to set trust-region radii and reduce manual global step-size tuning.
2026-05-18 Ringmaster LMO Extends Muon/LMO-style momentum to heterogeneous distributed training by allowing asynchronous delayed gradients, aiming to avoid straggler bottlenecks while preserving LMO update geometry.
2026-05-18 Symmetry-Compatible Optimizers Assigns row-norm, spectral, or hybrid optimizer geometries according to parameter symmetry groups, extending Muon-style equivariance so each block receives an update compatible with its transformation structure.
2026-05-18 AMO Adapts how often each matrix is orthogonalized by estimating Newton-Schulz difficulty from layer geometry, spending Muon computation where it matters and reducing unnecessary matrix-multiply cost elsewhere.
2026-05-16 DynMuon Generalizes Muon from the polar factor U V^T to dynamic spectral shaping U Sigma^p V^T, tuning the exponent during training so updates can vary between gradient-preserving and spectrum-flattening behavior.
2026-05-13 Muon Spectral Flattening Analyzes Muon as a spectral-flattening operation that enlarges stable learning-rate ranges and accelerates convergence by redistributing update energy across singular directions rather than following raw gradient magnitudes.
2026-05-13 DP-Muon Combines differentially private per-example clipping and Gaussian noise with momentum and Newton-Schulz orthogonalization, adapting Muon to private training while studying privacy, utility, and spectral-update stability.
2026-05-12 Spectral Preconditioning Formulates constrained stochastic spectral preconditioning as a proximal extension of Muon and Scion, giving theory for heavy-tailed noise and matrix-norm geometries where spectral updates can outperform Euclidean ones.
2026-05-12 Pion Uses non-additive left and right orthogonal transforms to alter Muon-style update spectra while preserving singular values, separating directional rotation from full polar flattening in matrix optimization.
2026-05-12 MuonQ Code Compresses Muon optimizer states to 4-bit precision by optimizing directional fidelity, targeting memory reduction while preserving the orthogonalized update directions needed for training quality.
2026-05-11 Error Whitening Frames Gauss-Newton improvements as whitening prediction errors in function space, comparing the induced dynamics with Newton, Adam, and Muon to explain when curvature-aware whitening can help.
2026-05-11 Freon/Kaon Tests Muon-like optimizers with Schatten and randomized spectra, arguing that update alignment and descent potential can matter more than exactly matching a particular matrix geometry.
2026-05-11 Muon Fine-tuning Transfer Studies optimizer mismatch when switching Adam-pretrained models to Muon for fine-tuning, showing forgetting and instability correlate with update strength and proposing transfer procedures for safer Muon fine-tuning.
2026-05-11 Optimizer-Induced Mode Connectivity Shows same-optimizer solution sets can be connected while AdamW and Muon regions may be separated by loss barriers, using theory and GPT-2 pretraining paths to expose optimizer-dependent implicit regularization.
2026-05-11 Muown Diagnoses Muon spectral-norm drift as row-magnitude growth rather than row-coherence change, then treats row magnitudes as explicit optimizer variables while applying Muon to the remaining direction component.
2026-05-11 SODA Unifies optimizers including Muon, Lion, AdEMAMix, and NAdam as optimistic dual-averaging methods, then proposes a SODA wrapper with a 1/k weight-decay schedule to reduce weight-decay tuning.
2026-05-10 Muon Phase Analysis Derives deterministic dynamics for stochastic SignSVD/Muon-like spectral optimizers on matrix least squares, identifying batch-size phases where Muon preconditions covariance spectra and when it degenerates toward SGD-like behavior.
2026-05-10 Dimension-Free Muon Escape Analyzes Muon saddle-point escape in high-dimensional landscapes and proves its nonlinear spectral shaping can avoid dimension-dependent trapping that affects element-wise adaptive optimizers such as AdamW.
2026-05-10 Intrinsic Muon Lifts Muon-style linear minimization oracles to Riemannian matrix manifolds by defining intrinsic tangent-space norms from the metric, preserving quotient symmetries for low-rank, orthogonal, and SPD parameters.
2026-05-09 ZO Partial Orthogonalization Adapts spectral optimization to zeroth-order LLM fine-tuning, showing full orthogonalization is too noisy and proposing power-iteration-based partial orthogonalization to exploit weak spectral directions safely.
2026-05-09 Muon Non-Convergence Proves Muon fails to converge on convex Lipschitz functions for any learning-rate schedule, then shows error feedback restores theoretical convergence even though it can hurt empirical image-classification performance.
2026-05-09 Muon-OGD Builds a continual-learning projection method using Muon-style spectral-norm geometry, replacing Frobenius orthogonal-gradient projection with spectral-aware updates for matrix-valued LLM parameters.
2026-05-09 Group Muon Compares full-matrix, head-wise, and grouped Muon for attention projections, deriving a trade-off between group-wise whitening gain and grouping-induced norm cost and tuning group size as an optimizer hyperparameter.
2026-05-08 PolarAdamW Applies Muon's Newton-Schulz polar map to AdamW-preconditioned directions, separating polar spectral control from Schur gauge-equivariance and testing the hybrid on DeiT-Tiny vision training.
2026-05-08 OrScale-LM Adds a layer-wise trust ratio to Muon using the Frobenius norm of the actual parameter-space update direction, avoiding failure modes of naive Muon-LAMB hybrids and calibrating language-model layers at initialization.
2026-05-07 Orth-Dion Shows Dion's column-normalized low-rank approximation misses the rank-r polar factor targeted by Muon, then orthogonalizes the compressed factor to remove geometric mismatch in distributed spectral optimization.
2026-05-07 Nesterov Muon Develops convergence theory for practical Muon with Nesterov momentum, heavy-tailed stochastic gradients, and inexact/randomized polar decomposition, quantifying how approximation errors propagate.
2026-05-07 SignSGD/Muon Lower Bounds Code Uses l1 stationarity, l-infinity smoothness, and separable noise assumptions to derive matching lower and upper bounds explaining when sign-based methods such as SignSGD and Muon can beat SGD.
2026-05-07 Pro-KLShampoo Observes spike-and-flat spectra in KL-Shampoo Kronecker preconditioners and projects the structured preconditioned direction through orthogonalization, recovering whitening in a Muon-adjacent optimizer.
2026-05-07 Implicit Gradient Transport Introduces LMO-IGT, using implicit gradient transport to accelerate LMO-based optimizers like Lion and Muon without extra gradient evaluations, alongside a regularized support-function stationarity measure.
2026-05-05 Aurora Code Diagnoses Muon row-leverage anisotropy on tall matrices as a cause of dead MLP neurons, then adds leverage-aware row normalization/equilibration so rectangular-matrix updates keep useful row mass while preserving polar precision.
2026-05-05 Nora Projects row-wise momentum onto the orthogonal complement of weights to respect scale invariance, aiming to combine Muon-like matrix preconditioning, stable norm/angular dynamics, and O(mn) update cost.
2026-05-04 SignMuon Combines Muon polar directions with signSGD-style 1-bit majority-vote communication: workers orthogonalize local momentum, transmit entrywise signs, and optionally apply a local polar correction for bandwidth-efficient distributed training.
2026-04-27 SUDA-Muon Shows decentralized Muon is hard because matrix-sign orthogonalization does not commute with gossip averaging, then separates primal-dual communication from the nonlinear Muon step using a SUDA template with convergence guarantees.
2026-04-16 CLion Adds cautious-update masking to Lion and studies generalization via algorithmic stability, giving theory for Lion-style sign momentum and empirical evidence that the cautious variant improves robustness/generalization.
2026-04-12 Federated Gluon Adapts Gluon/Muon-style LMO optimization to federated learning with unbiased or contraction compressors plus SARAH-style variance reduction, targeting communication-efficient non-Euclidean training.
2026-04-11 Muon^2 Applies Adam-style second-moment preconditioning before Muon orthogonalization so the Newton-Schulz polar approximation starts from a better-conditioned matrix, improving both update quality and iteration efficiency.
2026-04-10 APT for MTL Studies why multi-task-learning gradient balancing can be weakened by advanced optimizer momentum, then adjusts the interaction with optimizers such as Muon so task de-conflicting affects the actual update direction.
2026-04-09 Adam-HNAG Reformulates full-batch Adam through variable/operator splitting and curvature-aware gradient correction, yielding Adam-HNAG flows and discrete variants with Lyapunov-based accelerated convergence guarantees.
2026-04-06 Muon Spectral Wasserstein Flow Derives continuous-time mean-field dynamics for normalized matrix flows, defining Spectral Wasserstein distances where the operator-norm case captures Muon geometry and Schatten norms interpolate with classical W2.
2026-04-06 Muon-Accelerated Tensor GLM Applies Muon-style orthogonalized acceleration to low-separation-rank tensor generalized linear models, preserving tensor structure while improving the block-coordinate estimation used in LSR tensor regression.
2026-04-05 SIFT/Subspace Control Frames constrained model steering as spectral subspace-control optimization, using subspace orthogonalization to reduce interference between the primary objective and safety/privacy/task constraints.
2026-04-01 Newton-Muon Derives an optimizer from a quadratic surrogate involving the gradient, output-space curvature, and input data matrix, adding input-side Newton preconditioning to Muon-style orthogonalized updates.
2026-03-30 HyperP Builds hypersphere parameterization for Muon-style Frobenius-norm-constrained training, transferring optimal learning rates across width, depth, token budget, and MoE granularity under a fixed-norm parameterization.
2026-03-30 MuonEq Balances the momentum matrix before Newton-Schulz using row, column, or two-sided lightweight equilibration, improving the singular-value geometry seen by finite-step Muon orthogonalization.
2026-03-27 Sharp Capacity Scaling Uses linear associative memory as a tractable factual-recall model to compare one-step recovery rates for Muon, SGD, and Newton, characterizing when spectral optimizers recover overcomplete associations faster.
2026-03-20 RMNP Code Replaces Muon's Newton-Schulz preconditioner with row-momentum normalization, targeting O(mn) matrix preconditioning that keeps much of Muon's benefit with lower wall-clock overhead.
2026-03-18 MUD Substitutes Muon's repeated polar-factor matrix multiplications with a triangular Cholesky/Gauss-Seidel-inspired whitening surrogate, aiming to decorrelate momentum faster on transformer matrices.
2026-03-16 Hyperparameter Scaling Laws Uses convergence bounds for LMO-based optimizers, including normalized SGD, signSGD/Adam proxies, and Muon, to derive scaling rules for batch size and training horizon beyond model-size-only transfer.
2026-03-16 Muon Heavy-Tailed Convergence Proves Muon convergence for nonconvex Holder-smooth empirical risk under bounded heavy-tailed stochastic noise, weakening the usual bounded-variance assumptions common in optimizer theory.
2026-03-15 SPECTRA Adds post-update spectral clipping and optional pre-filtering to control large update spectral norms and sparse spectral noise spikes, making spectral structure explicit even for AdamW-style optimizers.
2026-03-10 HTMuon Uses Heavy-Tailed Self-Regularization theory to modify Muon so updates preserve heavier-tailed spectra instead of over-flattening noise directions, improving LLM pretraining and image-classification performance in experiments.
2026-03-10 Mousse Adds curvature-aware preconditioning to Muon so the optimizer does not apply uniform spectral steps across highly anisotropic curvature directions, reducing high-curvature instability and flat-direction underprogress.
2026-03-10 MOGA Interprets AdamW, Muon, and related methods as steepest descent under mean-normalized matrix operator norms, deriving row/column-normalized updates for width-stable hyperparameter transfer.
2026-03-04 NuMuon Adds a nuclear-norm constraint to Muon to encourage compressible low-rank weight structure while preserving Muon-style full-rank training benefits and favorable convergence behavior.
2026-02-28 Muon Simplicity Bias Investigates downside biases introduced by Muon's speed-oriented spectral updates, analyzing cases where faster optimization can trade off with the simplicity bias often associated with generalization.
2026-02-28 MuonRec Code Adapts Muon orthogonalized momentum to scalable recommendation training, challenging AdamW defaults and reporting fewer training steps plus improved ranking quality in generative recommender models.
2026-02-27 LoRA-Pre Code Reinterprets optimizer momentum EMAs as online linear regressors, then stores Adam/Muon momentum in a low-rank LoRA-style subspace to reduce optimizer-state memory for pretraining and fine-tuning.
2026-02-26 LITE Code Uses a Riemannian dynamics view to show Muon and SOAP can be too conservative in flat directions, then increases flat-direction damping/learning rates to accelerate LLM pretraining.
2026-02-26 FlashOptim Code Reduces mixed-precision training memory by combining optimizer-state quantization with master-weight splitting while preserving API compatibility and model quality for large-model training.
2026-02-25 MUON+ Code Identifies post-polar row/column imbalance after practical Newton-Schulz Muon steps and adds one normalization step to improve blockwise descent without changing Muon's core orthogonalized-momentum design.
2026-02-24 Spectral Conditions for muP Extends maximal-update-parameterization ideas to spectral optimizers by deriving conditions for feature-learning hyperparameter transfer across Shampoo, Muon, and related matrix methods.
2026-02-19 ZO-Muon Combines zeroth-order finite-difference gradient estimation with subspace projection and Muon-style orthogonalization, reducing query variance while exploiting low-rank update structure in memory-efficient fine-tuning.
2026-02-19 NAMO Integrates Muon orthogonalized momentum with Adam-type noise adaptation through a norm-based adaptive stepsize, including a diagonal NAMO-D variant for additional stochastic stability.
2026-02-18 Adam/Muon Implicit Bias Shows momentum steepest-descent methods such as Muon, Signum, and MomentumGD follow approximate norm-specific steepest-descent trajectories and converge toward KKT points of corresponding max-margin problems.
2026-02-18 SpecMuon Adds spectral guidance and mode-wise RSAV step control to Muon for physics-informed neural networks and operators, tempering unit-singular-value steps in stiff multi-scale scientific-learning losses.
2026-02-17 Magma Finds random update masking can regularize adaptive optimizers through curvature-dependent smoothing, then introduces momentum-aligned gradient masking as a dense-optimizer alternative that outperforms Adam/Muon in reported LLM pretraining.
2026-02-13 TrasMuon Keeps Muon's near-isometric orthogonalized direction but restores magnitude control through global RMS calibration and energy-based trust-region clipping to reduce step-size sensitivity and high-energy bursts.
2026-02-12 Muon Quadratic Insights Uses simple strongly convex quadratic examples to show local one-step proxies and worst-case polar-error bounds miss key Muon behavior, motivating a dynamical view of spectral optimizer trajectories.
2026-02-12 Mini-batch Steepest Descent Bias Characterizes mini-batch stochastic steepest descent under entry-wise and Schatten-p norms, showing how batch size, momentum, and variance reduction shape max-margin implicit bias for SignSGD- and Muon-like methods.
2026-02-10 Clarifying Shampoo Decomposes Shampoo updates into an adapted Muon-like spectral step and shows its stochastic/trajectory adaptation explains why Shampoo can be more token-efficient than plain Muon, paralleling Adam versus Signum.
2026-02-09 Pion/Leon Builds adaptive operator-norm matrix online-learning algorithms by smoothing nuclear-norm potentials, yielding Pion/Leon-style methods with regret guarantees and nonsmooth nonconvex optimization rates.
2026-02-08 TSR-Adam Introduces two-sided low-rank synchronization for Adam-family distributed training, communicating a compact U^T G V core to reduce bandwidth and memory relative to one-sided low-rank approaches.
2026-02-07 Sign-Based Heavy-Tail Optimizers Explains empirical gains of Lion, Muon, and other sign-based optimizers through heavy-tailed gradient noise, giving generalized noise conditions under which sign updates outperform variance-adaptive methods.
2026-02-06 Unified Vector/Matrix Adaptivity Decomposes AdaGrad into variance adaptation and scale-invariant update factors, using that split to bridge Adam-style vector adaptivity with Muon-style matrix spectral optimization.
2026-02-06 Muon LoRA Spectral Growth Analyzes Muon/SpecGD dynamics in LoRA-style matrix factorization, showing LoRA product singular values grow nearly uniformly even when orthogonalization is applied separately to the low-rank factors.
2026-02-05 Norm-Constrained Warm-Up Derives warm-up-then-decay schedules for norm-constrained optimizers such as Muon and Lion from a generalized smoothness assumption where curvature falls with suboptimality, replacing heuristic warm-up with adaptive scheduling theory.
2026-02-05 Muon Associative Memory Studies Muon in a softmax linear associative-memory model with hierarchical frequency structure, showing spectral updates reduce frequency-dependent learning imbalance that slows gradient descent.
2026-02-05 ADANA Introduces logarithmic-time schedules for AdamW momentum and weight decay, letting gradient-memory horizons grow with training time and using damping to stabilize improved language-model scaling.
2026-02-04 Canzona Provides an asynchronous, load-balanced framework for distributed matrix optimizers such as Shampoo, Muon, and SOAP, reconciling holistic matrix updates with tensor sharding in Megatron-style training.
2026-02-04 BeyondMuon Code Views Muon as the p=0 endpoint of U Sigma^p V^T spectral transformations and evaluates intermediate RMS/spectral variants to clarify how Muon relates to Adam-style adaptivity.
2026-02-03 PRISM Spectral Shaping Adds low-rank quasi-second-order information to first-order spectral descent through innovation-augmented polar decomposition, suppressing high-variance subspaces while preserving signal-dominated directions with little overhead.
2026-02-03 Non-Euclidean GNS Generalizes gradient-noise-scale batch-size adaptation to signSGD/Signum and spectral-descent/Muon geometries, replacing Euclidean SGD assumptions with norm-matched stochastic noise estimates.
2026-02-01 OLion Combines Lion-style sign momentum with approximate Newton-Schulz orthogonalization and a final entrywise sign step, approximating steepest descent over intersected spectral and l-infinity constraints.
2026-01-30 Spectra Identifies persistent spike-tail anisotropy in LLM gradients and proposes spike-aware spectral suppression so dominant low-rank directions do not throttle tail learning, stability, and downstream quality.
2026-01-30 TEON Generalizes layer-wise Muon by treating network gradients as structured higher-order tensors and orthonormalizing across tensor modes, with convergence arguments and LLM pretraining experiments.
2026-01-30 Spectral GD Phase Retrieval Uses anisotropic phase retrieval as a model problem to show spectral gradient updates can avoid misalignment caused by dominant covariance directions that distract ordinary gradient descent.
2026-01-30 Mano Revisits manifold optimization for LLMs by projecting momentum onto tangent spaces and constraining updates on rotational oblique manifolds, seeking less memory/compute than AdamW while preserving curvature structure better than Muon.
2026-01-29 PRISM Matrix Functions Accelerates matrix square roots, inverse roots, and orthogonalization used by Shampoo/Muon through adaptive polynomial fitting plus randomized iterative sketching, avoiding eigendecomposition while improving GPU efficiency.
2026-01-29 FISMO Combines Fisher-structured adaptivity with Muon-style momentum orthogonalization so update spectra retain curvature information instead of being forced into strict isotropic singular values.
2026-01-29 MCSD/SPEL Extends norm-constrained LMO optimizers such as spectral descent and Muon to manifold constraints with a single-loop method: choose a norm-induced Riemannian steepest direction and project back to the manifold.
2026-01-27 Muon Convergence Rates Provides a simpler direct analysis of Muon for nonconvex optimization, improving convergence-rate guarantees without relying on restrictive assumptions about the orthogonalized update rule.
2026-01-27 Muon with Newton-Schulz Analyzes practical Muon with finite Newton-Schulz orthogonalization, proving it matches ideal SVD-polar Muon convergence up to a constant that improves doubly exponentially with the number of NS steps.
2026-01-21 Variance-Adaptive Muon Introduces Muon-NSR and Muon-VS, applying variance-adaptive normalization to momentum before orthogonalization to combine Adam-like stochastic robustness with Muon matrix geometry.
2026-01-20 Muon Spectral Orthogonalization Studies simplified Muon in matrix factorization and linear-transformer in-context learning, giving end-to-end explanations of how spectral orthogonalization acts as useful preconditioning.
2026-01-18 IFNSO Collapses iterative Newton-Schulz orthogonalization into an iteration-free polynomial formulation by analyzing matrix-power contributions, reducing repeated high-dimensional matmul overhead in Muon/Stiefel settings.
2026-01-13 Spectral Sphere Optimizer (SSO) Code Constrains both module weights and updates on a spectral sphere, aligning Muon-style update control with muP-like activation stability so weights cannot drift while updates remain norm-controlled.
2026-01-04 Principled Muon under muP Studies how to maintain muP spectral conditions throughout training with matrix optimizers such as Muon, aiming to preserve width-independent dynamics and hyperparameter transfer in practical runs.
2025-12-18 MVR-Gluon Adds momentum variance reduction to the Gluon framework, giving one theory path that covers Muon, Scion, and other LMO optimizers and proves faster stochastic convergence than vanilla momentum.
2025-12-15 HCM-LMO Injects Hessian-corrected momentum into arbitrary-norm LMO optimizers such as Muon, Scion, and Gluon, using second-order information to improve rates beyond standard stochastic momentum bounds.
2025-12-10 Fanions Constructs Muon-like optimizers from duals of Ky-Fan norms and their Frobenius/l-infinity combinations, yielding Fanions, F-Muon, and S-Muon families that interpolate matrix-update geometries.
2025-12-05 Matrix-Preconditioned Hyperparameter Transfer Studies learning-rate and weight-decay scaling for matrix-preconditioned optimizers such as Shampoo, SOAP, and Muon, using hyperparameter transfer to make gains consistent across Llama model sizes.
2025-12-04 Turbo-Muon Preconditions the matrix before Newton-Schulz orthogonalization so Muon reaches useful polar accuracy with fewer matrix multiplications, making the overhead of acceleration nearly negligible.
2025-12-03 Spectral Gradient Conditions Derives a layer-wise diagnostic comparing gradient nuclear/Frobenius ratio with activation stable rank to predict when a Muon-style spectral step should beat a Euclidean gradient step.
2025-10-07 NorMuon Code Combines Muon orthogonalization with neuron-wise normalization and second-order statistics, targeting better scalability and efficiency by jointly leveraging Adam-like adaptivity and Muon matrix geometry.
2025-10-06 DP-Adam-AC Adds adaptive clipping to differentially private Adam fine-tuning for localizable LLMs, improving the privacy-utility trade-off when task-specific models must be fine-tuned under privacy constraints.
2025-10-04 Hill-ADAM Augments Adam with deterministic state-space exploration through alternating minimization/maximization phases, aiming to escape local minima rather than settling at the first basin visited.
2025-09-29 Conda Combines Adam coordinate adaptivity with column-normalized updates, addressing Adam's low-rank/spectrally poor update structure while retaining per-coordinate variance scaling for faster LLM training.
2025-09-19 AdaGrad++ Proposes a simpler parameter-free AdaGrad variant with convergence guarantees, removing manual learning-rate tuning while preserving AdaGrad-like rates in convex optimization.
2025-09-19 Adam++ Proposes a parameter-free Adam variant with convergence guarantees that match Adam-style rates without assuming preselected learning-rate conditions, reducing manual tuning burden.
2025-09-03 KL-Shampoo / KL-SOAP Code Recasts Shampoo/SOAP second-moment estimation as KL-divergence covariance estimation rather than Frobenius fitting, producing KL preconditioners that reduce Adam grafting or Adam-in-eigenbasis overhead.
2025-09-01 ZO Fine-Tuner Learns a compact perturbation strategy for zeroth-order LLM fine-tuning, training the optimizer once per foundation model and reusing it across downstream tasks to beat hand-designed ZO baselines.
2025-09 SRON Uses row-wise gradient normalization for state-free LLM training to reduce optimizer-state overhead while stabilizing matrix updates.
2025-07-15 AdaMuon Applies element-wise second-moment scaling to Muon's orthogonalized directions, with sign-stabilized momentum and RMS-aligned rescaling to combine variance adaptivity with stable matrix geometry.
2025-06-20 SCALE Code Finds column-wise gradient normalization plus last-layer momentum are sufficient minimalist changes to SGD for competitive LLM pretraining, matching Adam with much lower optimizer-state memory.
2025-06-08 SPlus Code Stabilizes Shampoo-style whitening by combining historical eigenbases with instantaneous normalization and shape-aware scaling, reducing divergence from stale inverse caches and improving wall-clock efficiency.
2025-05-27 PolarGrad Unifies matrix-aware preconditioned optimizers and introduces polar-decomposition update rules that explain links among Adam/Shampoo/Muon-style methods while improving convergence in reported experiments.
2025-05-19 Gluon Bridges theory and practice for LMO-based optimizers by generalizing Muon and Scion into a layer-wise framework with better memory efficiency, hyperparameter transfer, and LLM training performance.
2025-04-07 Dion Code Replaces Muon's dense Newton-Schulz orthogonalization with distributed amortized power iteration plus error feedback, enabling low-rank orthonormalized updates compatible with sharded LLM training.
2025-02-24 COSMOS Code Splits matrix-gradient subspaces between SOAP and Muon: it applies richer adaptive preconditioning to leading eigendirections and cheaper Muon-style updates to the remaining space for memory-efficient LLM training.
2025-02-24 D-Muon Code Scales Muon to larger LLM training by adding weight decay and per-parameter update-scale adjustment, reporting about 2x compute efficiency over AdamW under compute-optimal scaling and releasing the Moonlight recipe.
2025-02-11 Scion Code Introduces stochastic LMO-over-norm-ball optimizers for unconstrained deep learning, choosing norms that improve memory efficiency and hyperparameter transfer while giving nanoGPT speedups.
2024-12-08 Muon Code Defines Muon as SGD/Nesterov momentum followed by Newton-Schulz orthogonalization of 2D hidden-layer updates, usually paired with AdamW for embeddings, heads, and non-matrix parameters.
2024-12-02 PROFIT Designs an optimizer specifically for deep fine-tuning converged models, using temporal gradient orthogonalization and assumptions about pretrained weights to regularize task adaptation beyond generic SGD/Adam.
2024-11-25 Cautious Optimizers Code Adds a one-line cautious mask to momentum optimizers, preserving Adam-style Lyapunov convergence while empirically improving transformer training for C-AdamW, C-Lion, and related variants.
2024-11-15 MARS Code Brings variance-reduction ideas into large-model optimization by correcting adaptive/sign optimizer updates with gradient-difference signals, producing MARS variants that improve GPT-style training efficiency.
2024-11-11 Subset-Norm + Subspace-Momentum Code Combines Subset-Norm step sharing, which reduces AdaGrad memory from O(d) to O(sqrt(d)), with Subspace-Momentum to lower optimizer-state cost while retaining convergence guarantees.
2024-11-05 ADOPT Code Modifies Adam so it converges at the optimal O(1/sqrt(T)) rate for any beta2 without bounded-gradient-noise assumptions, addressing Adam non-convergence while keeping adaptive-gradient practicality.
2024-10-25 COAT Code Compresses optimizer states and activations into FP8 training through dynamic range expansion and related memory techniques, reducing memory footprint beyond FP8 linear-layer-only frameworks.
2024-10-21 LDAdam Code Performs Adam-style adaptivity in changing low-dimensional projected subspaces, using projection-aware state updates and generalized error feedback to lower memory while still exploring full parameter space.
2024-09-17 SOAP Code Shows Shampoo with half-power preconditioning is equivalent to Adafactor in Shampoo's eigenbasis, then stabilizes and simplifies Shampoo by combining it with Adam-style moment updates.
2024-09-05 AdEMAMix Code Adds a second, slower EMA of older gradients to AdamW-style momentum so the optimizer can use both recent directions and longer-term gradient memory, improving token efficiency.
2024-06-24 Adam-mini Code Cuts Adam memory by replacing most per-parameter second-moment learning rates with block-level rates chosen from Hessian-structure principles, retaining AdamW-like performance at about half memory.
2024-05-24 SF-AdamW (Schedule-Free) Code Removes explicit learning-rate schedules by using schedule-free momentum/iterate-averaging theory, matching scheduled training performance without knowing the stopping step in advance.
2024-05-24 MicroAdam Code Compresses gradients before feeding them into Adam states and uses compressed error feedback, reducing optimizer-state memory while preserving convergence guarantees.
2024-05-21 FAdam Code Interprets Adam as a diagonal empirical-Fisher natural-gradient method, identifies approximation flaws in standard Adam, and proposes corrections to momentum, bias correction, and epsilon handling.
2023-12-04 AGD Code Builds an auto-switchable preconditioner from stepwise gradient differences, using Hessian-related diagonal information to choose between adaptive and non-adaptive behavior during training.
2023-10-16 AdaLOMO Code Adds adaptive learning rates to low-memory full-parameter LLM fine-tuning, keeping LOMO-style memory savings while closing much of the convergence gap to AdamW.
2023-09-05 AdaPlus Code Combines AdamW, NAdam, and AdaBelief ideas by adding Nesterov momentum and precise step-size adjustment without extra hyperparameters, improving language-model and vision training baselines.
2023-07-28 CoRe Code Benchmarks Continual Resilient optimization as a robust all-in-one first-order optimizer, emphasizing smooth convergence, low tuning burden, and broad applicability across machine-learning tasks.
2023-07-18 Adam+CM Code Augments Adam with a memory buffer of critical momentum terms that intentionally overshoot narrow minima, encouraging exploration toward flatter basins and better generalization.
2023-07-05 CAME Code Uses confidence-guided memory-efficient second-moment estimation to reduce instability in Adafactor-like optimizers, aiming for Adam/LAMB quality with much lower auxiliary memory.
2023-06-16 LOMO Code Fuses gradient computation and parameter updates so full-parameter LLM fine-tuning can run under limited GPU memory, avoiding the optimizer-state cost of AdamW-style training.
2023-06-09 Prodigy Code Improves D-Adaptation by estimating the distance-to-solution parameter needed for optimal learning rates, yielding a parameter-free optimizer that matches tuned methods across many tasks.
2023-05-25 WSAM Code Recasts SAM sharpness as an explicit weighted regularization term, deriving generalization bounds and improving or matching standard optimizers on benchmark datasets.
2023-05-25 DoWG Code Introduces Distance-over-Weighted-Gradients, a parameter-free optimizer that adapts to smooth and nonsmooth convex problems using a distance-weighted gradient accumulator.
2023-05-23 Sophia Code Uses a lightweight diagonal Hessian estimate as a stochastic second-order preconditioner with clipping, reducing language-model pretraining cost relative to Adam-family methods.
2023-05-09 UAdam Provides a unified Adam-type framework covering Adam, NAdam, AMSGrad, AdaBound, AdaFom, and Adan through a general second-moment form with nonconvex stochastic convergence analysis.
2023-02-16 FOSI Code Wraps any first-order optimizer with low-dimensional second-order corrections by splitting the objective into orthogonal subspaces, using curvature where useful and the base optimizer elsewhere.
2023-02-13 Lion Code Discovers a memory-efficient sign-momentum optimizer through symbolic program search, keeping only momentum state and using update signs rather than Adam-style second moments.
2023-02-08 DoG Code Sets SGD step sizes dynamically from distance-to-initial-point and gradient norms, removing the learning-rate hyperparameter while matching tuned SGD on vision and language transfer tasks.
2023-01-18 D-Adaptation Code Automatically estimates the learning-rate scale for SGD/Adam/AdaGrad variants without line searches or extra gradients, matching hand-tuned rates across diverse convex and ML experiments.
2022-11-17 VeLO Code Meta-trains a neural-network optimizer at large scale across many tasks, producing a versatile learned optimizer that transfers with little tuning and exhibits non-hand-designed update behavior.
2022-10-21 Amos Code Adds model-oriented adaptive learning-rate decay and weight decay to Adam-style optimization, improving BERT/T5 pretraining speed while using less slot-variable memory than AdamW.
2022-10-12 AdaNorm Code Corrects each iteration's gradient norm using adaptive history so SGD/momentum-style optimizers maintain representative update magnitudes and improve CNN convergence.
2022-08-13 Adan Code Derives an adaptive Nesterov momentum estimator that avoids extra extrapolation gradients and estimates first/second moments for faster, robust deep-model training.
2022-06-14 GradaGrad Modifies AdaGrad's monotonically shrinking denominator so the effective learning rate can both grow and shrink, preserving similar convergence rates while reducing tuning sensitivity.

Weakly Related / Adjacent Muon Papers

These papers mention, compare, or rely on Muon-style optimization, but their main contribution is broader than a standalone Muon-family optimizer.

Date Paper Relation
2026-06-14 Schattor Introduces a Schatten-norm optimizer family that unifies SGD and Muon-like matrix updates with dimension-free stationarity theory; it is adjacent because the contribution is a broader optimizer framework, not a Muon variant.
2026-06-13 When to use Schatten-p Norm Analyzes which Schatten-p geometries are optimal under different scaling/noise regimes, explaining when Muon-like Schatten-infinity updates help and when smaller-p geometries are preferable.
2026-06-12 ZO Parameter-free LMO Combines zeroth-order fine-tuning with parameter-free LMO methods to reduce memory and tuning burden, borrowing Muon-style geometry but focusing on ZO/PF optimization broadly.
2026-06-12 Zeta Diagnoses coordinate-scale heterogeneity in matrices before Newton-Schulz, then proposes coordinate-adaptive dual whitening; Muon appears mainly as a matrix-aware baseline and motivation.
2026-06-11 Different Layers Different Manifolds Tests module-wise manifold assignments for GPT-2 training, finding Stiefel constraints suit attention while DGram suits MLPs; Muon is used through a Manifold Muon lens rather than modified directly.
2026-06-09 Overcoming Rank Collapse in Feedback Alignment Studies why feedback alignment fails in deeper networks and uses orthogonalized/Muon-style updates to increase feedback-signal rank, but the target is biologically plausible learning, not optimizer design.
2026-06-04 PC Layer Adds a polynomial weight-preconditioning layer that reshapes singular spectra during LLM pretraining and can be merged away at inference; Muon is one optimizer tested with the architectural preconditioner.
2026-06-04 Double Preconditioning Optimizes for rollout/test-time performance under feedback mismatch rather than validation loss, treating Muon as one possible gradient-wise preconditioner inside a broader DoPr framework.
2026-06-02 Ultralytics YOLO26 Presents a real-time vision model family with NMS-free heads and training changes; Muon-SGD appears as part of the recipe, but the main contribution is architecture/system design.
2026-06-01 WALL-WM Builds event-grounded world-action-model pretraining for VLA/video-action learning; Muon is part of the large-scale training stack rather than the object of study.
2026-05-30 Exploiting Weight-Space Symmetries for Approximating Curvature Uses weight-space symmetry averaging to construct tractable curvature approximations from single gradients, recovering Shampoo/Muon-like structures as cases of a broader symmetry framework.
2026-05-29 Mellum2 Technical Report Reports a 12B MoE code/general model and mentions Muon in the FP8 hybrid-precision training recipe; it is model-report evidence of Muon use, not an optimizer contribution.
2026-05-28 On the Optimizer Dependence of Neural Scaling Laws Shows scaling-law exponents vary with optimizer choice in random-feature regression and related settings, using Matrix-Sign/Muon-like methods as one optimizer family under comparison.
2026-05-27 Parallax Introduces parameterized local linear attention for language modeling and uses Muon to stabilize/scale training; the optimizer is enabling infrastructure, not the paper's main method.
2026-05-26 How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks Compares Adam and Muon across equivariant/geometric models, finding Muon often improves optimization; it analyzes optimizer effects rather than proposing a new optimizer.
2026-05-26 The Stability of Singular Distribution Identifies early stabilization of trace-normalized singular spectra during LLM pretraining across schedules and optimizers including Muon, connecting spectral dynamics to two-phase loss curves.
2026-05-23 Momentum Streams for Optimizer-Inspired Transformers Interprets Transformer residual updates as optimizer steps and designs architectures inspired by momentum, Adam, Muon, and SOAP; the contribution is architectural, not an optimizer update rule.
2026-05-20 Same Architecture, Different Capacity Measures FFN representation spectra and shows AdamW and Muon produce different spectral scaling laws at fixed architecture, framing optimizer choice as a capacity-scaling axis.
2026-05-18 Scale-Invariant Neural Network Optimization Develops theory for scale-invariant optimizers under norm geometry and heavy-tailed noise, including Muon and Scion as examples within a broader optimization principle.
2026-05-09 Navigating LLM Valley Surveys LLM optimizer design from AdamW through memory-efficient and matrix-based methods, positioning Muon within the broader optimizer landscape rather than introducing a method.
2026-05-07 Optimizer-Model Consistency Shows full fine-tuning with the same optimizer used in pretraining can reduce forgetting, comparing Muon and AdamW behavior as evidence for optimizer-shaped model states.
2026-04-16 Benchmarking Optimizers for MLPs in Tabular Deep Learning Benchmarks 15 optimizers for tabular MLPs and finds Muon strong against AdamW; it is empirical optimizer evaluation rather than new algorithm design.
2026-04-02 Normalization-Optimizer Coupling Tests normalization-layer choices with AdamW and Muon at 1B scale, showing Dynamic Erf interacts badly with Muon while RMSNorm/DyT behave differently.
2026-03-30 Spectral Edge Dynamics Uses rolling-window spectra of parameter updates to analyze phase transitions such as grokking and loss plateaus; optimizer spectra are diagnostic tools, not new update rules.
2026-02-25 veScale-FSDP Improves FSDP/ZeRO infrastructure for block-structured computations and non-element-wise optimizers such as Shampoo and Muon, focusing on sharding/runtime support.
2026-02-07 Robust Scaling Laws for Optimizers Studies Chinchilla-style and optimizer-specific scaling laws across AdamW, Muon, Shampoo, SOAP, and others, asking how optimizer choice changes compute/data/model scaling.
2026-01-31 Data Distribution as an Optimizer Lever Analyzes whether changing training data distribution can steer optimizer generalization behavior, comparing GD and SAM; it is adjacent to optimizer choice but not Muon-specific.
2026-01-14 Muon-Optimized Distillation and Quantization Combines GPTQ quantization, LoRA, distillation, and Muon-based fine-tuning in a deployment pipeline for compressed LLMs, using Muon as a component rather than proposing it.
2026-01-08 Learnable Multipliers Adds learnable scalar multipliers to matrix layers to escape weight-decay/noise equilibrium norms, validating the scaling idea under Adam and Muon.
2025-12-16 Optimizing Rank for INRs Argues vanilla MLP INRs fail at high frequencies due to stable-rank degradation and uses Muon-like high-rank updates/rank regularization to preserve representation rank.

About

A curated list of optimizer papers from 2022 onward

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors