Curated optimizer-design papers from 2022+, ordered by exact publication or submission date in reverse chronological order.
- CSV: data/optimizers.csv
- Weakly related Muon papers: data/muon_weakly_related.csv
- Date source: arXiv
publisheddate for arXiv papers; source-page submission date for OpenReview/blog entries. - Note:
SRONis a legacy unsourced row without a paper URL, so its date remains month-level until the source paper is verified.
| Date | Optimizer Name | Summary |
|---|---|---|
| 2026-06-16 | MGUP Code | Selects a fixed fraction of parameters for larger momentum-gradient-aligned steps while giving the rest smaller nonzero updates; works as a plug-in wrapper for AdamW, Lion, and Muon with convergence guarantees and LLM training experiments. |
| 2026-06-15 | Hyperball Optimization | Constrains both weight-matrix and optimizer-update Frobenius norms to fixed constants around a base optimizer such as Adam or Muon; reports 20-30% token-equivalent speedups on Qwen3-style models up to 1.2B and better learning-rate transfer. |
| 2026-06-15 | CacheMuon | Caches temporally correlated polar-factor information from previous steps to precondition Muon's Newton-Schulz computation, reducing repeated orthogonalization cost while exposing a quality-efficiency trade-off. |
| 2026-06-12 | Free Heavy-Tailed Lunch for Muon | Gives heavy-tailed nonconvex theory showing matrix-valued non-Euclidean optimizers such as Muon and Scion can avoid dimension-dependent costs and achieve stronger stationarity guarantees than Euclidean methods. |
| 2026-06-11 | Muon^p | Interpolates between gradient descent and Muon by using fractional spectral-power updates U S^p V^T; derives practical low-degree bivariate matrix recurrences because fixed univariate polynomial iterations cannot compute the required powers. |
| 2026-06-11 | LoRA-Muon | Derives a Muon-style spectral steepest-descent rule on the low-rank LoRA manifold, paired with split weight decay, to reduce initialization and stepsize sensitivity and improve rank/width/depth learning-rate transfer. |
| 2026-06-09 | FOGO | Frames forgetting as step-level gradient interference, then spectrally orthogonalizes momentum updates and keeps a compact codebook of past directions so dominant minibatch gradients do not erase rare useful directions. |
| 2026-06-08 | Muon Robust Transfer | Evaluates pretrained models under corrupted image/text shifts and layer-wise probes, finding Muon-trained features more robust and transferable than Adam or SGD features across transformer and CNN settings. |
| 2026-06-07 | OptMuon | Combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm coefficient schedule, turning orthogonalized momentum into a closed-loop method that calibrates update magnitudes from observed optimization history. |
| 2026-06-07 | Muon Spectral Dynamics | Analyzes Muon's polar update as a flat-spectrum, entropy-maximizing bias under alignment assumptions, deriving singular-value dynamics and showing how the update geometry changes noise behavior rather than merely rescaling gradients. |
| 2026-06-03 | Why Muon Outperforms Adam: A Curvature Perspective | Uses second-order Taylor decomposition to show Muon and Adam have similar first-order gain at matched validation loss, while Muon pays a smaller curvature penalty through lower normalized directional sharpness. |
| 2026-06-02 | Spectral Scaling Laws of Muon | Measures singular-value quantiles of Muon momentum matrices across model sizes and layers, identifying which directions finite Newton-Schulz misses and turning the spectra into layer-aware orthogonalization guidance. |
| 2026-06-02 | Denoise First, Orthogonalize Later | Models Muon momentum as a spectral filter that suppresses perturbation modes before orthogonalization, increasing the signal-perturbation gap and stabilizing the singular subspaces passed into Newton-Schulz/polar updates. |
| 2026-06-01 | A Note on Stability for Orthogonalized Matrix Momentum | Proves finite-round generalization bounds for client-sampled distributed optimization with orthogonalized matrix momentum, explicitly tracking heterogeneous client sampling and finite-step Newton-Schulz effects. |
| 2026-05-29 | How Much Orthogonalization Does Muon Need? | Studies relaxed low-precision Newton-Schulz schedules for Muon and shows cheaper partial orthogonalization can preserve training behavior, separating the need for useful spectral shaping from exact polar accuracy. |
| 2026-05-29 | Softsign / SoftMuon | Introduces SoftSignum, replacing hard sign updates with a temperature-controlled soft-sign map so updates can move between sign-like and magnitude-sensitive behavior; extends the same relaxation to matrix optimizers as SoftMuon with quantile temperature scheduling. |
| 2026-05-26 | Entry-Wise Clipping for Muon | Models language-model gradient noise as entry-wise heavy-tailed contamination, derives an entry-wise clipping surrogate that can control spectral noise, and positions it as a cheaper structural alternative or complement to Muon-style spectral normalization. |
| 2026-05-26 | Spectral Descent (SD/TSD) | Studies simplified Muon-like Spectral Descent and Truncated Spectral Descent under non-smooth convex objectives, proving linear convergence under sharpness and connecting regularized variants to Frank-Wolfe-style sublinear guarantees. |
| 2026-05-26 | Muon Adversarial Training | Tests whether Muon-style orthogonalized matrix updates improve adversarial training, deriving a spectral-norm stability ceiling and evaluating robustness across architectures and heterogeneous threat models. |
| 2026-05-26 | MONA | Adds a Nesterov-like acceleration term from an EMA of gradient differences directly into Muon before orthogonalization, with convergence analysis arguing it helps escape sharp minima while preserving Muon spectral regularization. |
| 2026-05-26 | MuCon | Replaces Muon's polar direction, which maps all singular values to one, with singular-value-clipped updates under a SpectralP scaling recipe, using a spectral-norm-ball projection to retain controlled magnitude information. |
| 2026-05-25 | EMA-Nesterov | Reinterprets Nesterov acceleration as trajectory extrapolation and replaces noisy one-step lookahead with an EMA of parameter updates, yielding a wrapper that stabilizes acceleration for stochastic nonconvex deep-learning training. |
| 2026-05-23 | Muon in Vision Transformers | Benchmarks Muon against AdamW for ViT training on ImageNet-100 and Pl@ntNet-300K, showing recipe-dependent gains and linking heavy augmentation to healthier matrix-gradient spectra and less late-training mode collapse. |
| 2026-05-22 | Regularized Muon Flow | Interprets smoothed Muon orthogonalization as the gradient of a Fenchel-dual nuclear-norm smoothing, recasting Muon as a mirror/prox update and deriving Hamiltonian probability-gradient-flow dynamics for mean-field training views. |
| 2026-05-21 | AMUSE | Explains Muon through a river-valley loss-landscape picture where orthogonalization speeds flat-direction progress but amplifies dominant-direction oscillations; combines Muon with schedule-free iterate averaging for anytime stable gradient evaluation. |
| 2026-05-21 | Layerwise Learning Rates (LLR) | Uses Heavy-Tailed Self-Regularization estimates from layer weight spectra to assign larger learning rates to weakly heavy-tailed layers and smaller rates to strongly heavy-tailed layers, reporting faster AdamW and Muon LLM pretraining. |
| 2026-05-19 | LionMuon | Alternates cheap Lion/sign-style steps with expensive Muon spectral steps using a shared dual-EMA buffer, matching Lion memory while reducing average cost and reporting Pareto gains over Muon, Lion, Signum, and AdamW. |
| 2026-05-19 | Schatten-p Adaptive Optimization | Derives a data-driven criterion from gradient and activation statistics to choose layer-wise Schatten-p LMO geometries, interpolating between SGD, Muon, Adam, and MuAdam-style updates instead of fixing the optimizer geometry. |
| 2026-05-19 | MiMuon | Analyzes Muon generalization via algorithmic stability, then mixes Muon orthogonalized gradients with momentum SGD to improve generalization while retaining fast matrix-aware convergence for large models. |
| 2026-05-19 | High-Pass Pion | Identifies failure modes of full Muon whitening in VLA training and RLVR, where low-rank or low-SNR gradients make tail amplification harmful; proposes Pion as a high-pass spectral filter that promotes useful directions while suppressing noise. |
| 2026-05-18 | Distance-Aware Muon | Develops adaptive step-scaling rules for normalized Muon directions, using trajectory distance, scale calibration, or descent certificates to set trust-region radii and reduce manual global step-size tuning. |
| 2026-05-18 | Ringmaster LMO | Extends Muon/LMO-style momentum to heterogeneous distributed training by allowing asynchronous delayed gradients, aiming to avoid straggler bottlenecks while preserving LMO update geometry. |
| 2026-05-18 | Symmetry-Compatible Optimizers | Assigns row-norm, spectral, or hybrid optimizer geometries according to parameter symmetry groups, extending Muon-style equivariance so each block receives an update compatible with its transformation structure. |
| 2026-05-18 | AMO | Adapts how often each matrix is orthogonalized by estimating Newton-Schulz difficulty from layer geometry, spending Muon computation where it matters and reducing unnecessary matrix-multiply cost elsewhere. |
| 2026-05-16 | DynMuon | Generalizes Muon from the polar factor U V^T to dynamic spectral shaping U Sigma^p V^T, tuning the exponent during training so updates can vary between gradient-preserving and spectrum-flattening behavior. |
| 2026-05-13 | Muon Spectral Flattening | Analyzes Muon as a spectral-flattening operation that enlarges stable learning-rate ranges and accelerates convergence by redistributing update energy across singular directions rather than following raw gradient magnitudes. |
| 2026-05-13 | DP-Muon | Combines differentially private per-example clipping and Gaussian noise with momentum and Newton-Schulz orthogonalization, adapting Muon to private training while studying privacy, utility, and spectral-update stability. |
| 2026-05-12 | Spectral Preconditioning | Formulates constrained stochastic spectral preconditioning as a proximal extension of Muon and Scion, giving theory for heavy-tailed noise and matrix-norm geometries where spectral updates can outperform Euclidean ones. |
| 2026-05-12 | Pion | Uses non-additive left and right orthogonal transforms to alter Muon-style update spectra while preserving singular values, separating directional rotation from full polar flattening in matrix optimization. |
| 2026-05-12 | MuonQ Code | Compresses Muon optimizer states to 4-bit precision by optimizing directional fidelity, targeting memory reduction while preserving the orthogonalized update directions needed for training quality. |
| 2026-05-11 | Error Whitening | Frames Gauss-Newton improvements as whitening prediction errors in function space, comparing the induced dynamics with Newton, Adam, and Muon to explain when curvature-aware whitening can help. |
| 2026-05-11 | Freon/Kaon | Tests Muon-like optimizers with Schatten and randomized spectra, arguing that update alignment and descent potential can matter more than exactly matching a particular matrix geometry. |
| 2026-05-11 | Muon Fine-tuning Transfer | Studies optimizer mismatch when switching Adam-pretrained models to Muon for fine-tuning, showing forgetting and instability correlate with update strength and proposing transfer procedures for safer Muon fine-tuning. |
| 2026-05-11 | Optimizer-Induced Mode Connectivity | Shows same-optimizer solution sets can be connected while AdamW and Muon regions may be separated by loss barriers, using theory and GPT-2 pretraining paths to expose optimizer-dependent implicit regularization. |
| 2026-05-11 | Muown | Diagnoses Muon spectral-norm drift as row-magnitude growth rather than row-coherence change, then treats row magnitudes as explicit optimizer variables while applying Muon to the remaining direction component. |
| 2026-05-11 | SODA | Unifies optimizers including Muon, Lion, AdEMAMix, and NAdam as optimistic dual-averaging methods, then proposes a SODA wrapper with a 1/k weight-decay schedule to reduce weight-decay tuning. |
| 2026-05-10 | Muon Phase Analysis | Derives deterministic dynamics for stochastic SignSVD/Muon-like spectral optimizers on matrix least squares, identifying batch-size phases where Muon preconditions covariance spectra and when it degenerates toward SGD-like behavior. |
| 2026-05-10 | Dimension-Free Muon Escape | Analyzes Muon saddle-point escape in high-dimensional landscapes and proves its nonlinear spectral shaping can avoid dimension-dependent trapping that affects element-wise adaptive optimizers such as AdamW. |
| 2026-05-10 | Intrinsic Muon | Lifts Muon-style linear minimization oracles to Riemannian matrix manifolds by defining intrinsic tangent-space norms from the metric, preserving quotient symmetries for low-rank, orthogonal, and SPD parameters. |
| 2026-05-09 | ZO Partial Orthogonalization | Adapts spectral optimization to zeroth-order LLM fine-tuning, showing full orthogonalization is too noisy and proposing power-iteration-based partial orthogonalization to exploit weak spectral directions safely. |
| 2026-05-09 | Muon Non-Convergence | Proves Muon fails to converge on convex Lipschitz functions for any learning-rate schedule, then shows error feedback restores theoretical convergence even though it can hurt empirical image-classification performance. |
| 2026-05-09 | Muon-OGD | Builds a continual-learning projection method using Muon-style spectral-norm geometry, replacing Frobenius orthogonal-gradient projection with spectral-aware updates for matrix-valued LLM parameters. |
| 2026-05-09 | Group Muon | Compares full-matrix, head-wise, and grouped Muon for attention projections, deriving a trade-off between group-wise whitening gain and grouping-induced norm cost and tuning group size as an optimizer hyperparameter. |
| 2026-05-08 | PolarAdamW | Applies Muon's Newton-Schulz polar map to AdamW-preconditioned directions, separating polar spectral control from Schur gauge-equivariance and testing the hybrid on DeiT-Tiny vision training. |
| 2026-05-08 | OrScale-LM | Adds a layer-wise trust ratio to Muon using the Frobenius norm of the actual parameter-space update direction, avoiding failure modes of naive Muon-LAMB hybrids and calibrating language-model layers at initialization. |
| 2026-05-07 | Orth-Dion | Shows Dion's column-normalized low-rank approximation misses the rank-r polar factor targeted by Muon, then orthogonalizes the compressed factor to remove geometric mismatch in distributed spectral optimization. |
| 2026-05-07 | Nesterov Muon | Develops convergence theory for practical Muon with Nesterov momentum, heavy-tailed stochastic gradients, and inexact/randomized polar decomposition, quantifying how approximation errors propagate. |
| 2026-05-07 | SignSGD/Muon Lower Bounds Code | Uses l1 stationarity, l-infinity smoothness, and separable noise assumptions to derive matching lower and upper bounds explaining when sign-based methods such as SignSGD and Muon can beat SGD. |
| 2026-05-07 | Pro-KLShampoo | Observes spike-and-flat spectra in KL-Shampoo Kronecker preconditioners and projects the structured preconditioned direction through orthogonalization, recovering whitening in a Muon-adjacent optimizer. |
| 2026-05-07 | Implicit Gradient Transport | Introduces LMO-IGT, using implicit gradient transport to accelerate LMO-based optimizers like Lion and Muon without extra gradient evaluations, alongside a regularized support-function stationarity measure. |
| 2026-05-05 | Aurora Code | Diagnoses Muon row-leverage anisotropy on tall matrices as a cause of dead MLP neurons, then adds leverage-aware row normalization/equilibration so rectangular-matrix updates keep useful row mass while preserving polar precision. |
| 2026-05-05 | Nora | Projects row-wise momentum onto the orthogonal complement of weights to respect scale invariance, aiming to combine Muon-like matrix preconditioning, stable norm/angular dynamics, and O(mn) update cost. |
| 2026-05-04 | SignMuon | Combines Muon polar directions with signSGD-style 1-bit majority-vote communication: workers orthogonalize local momentum, transmit entrywise signs, and optionally apply a local polar correction for bandwidth-efficient distributed training. |
| 2026-04-27 | SUDA-Muon | Shows decentralized Muon is hard because matrix-sign orthogonalization does not commute with gossip averaging, then separates primal-dual communication from the nonlinear Muon step using a SUDA template with convergence guarantees. |
| 2026-04-16 | CLion | Adds cautious-update masking to Lion and studies generalization via algorithmic stability, giving theory for Lion-style sign momentum and empirical evidence that the cautious variant improves robustness/generalization. |
| 2026-04-12 | Federated Gluon | Adapts Gluon/Muon-style LMO optimization to federated learning with unbiased or contraction compressors plus SARAH-style variance reduction, targeting communication-efficient non-Euclidean training. |
| 2026-04-11 | Muon^2 | Applies Adam-style second-moment preconditioning before Muon orthogonalization so the Newton-Schulz polar approximation starts from a better-conditioned matrix, improving both update quality and iteration efficiency. |
| 2026-04-10 | APT for MTL | Studies why multi-task-learning gradient balancing can be weakened by advanced optimizer momentum, then adjusts the interaction with optimizers such as Muon so task de-conflicting affects the actual update direction. |
| 2026-04-09 | Adam-HNAG | Reformulates full-batch Adam through variable/operator splitting and curvature-aware gradient correction, yielding Adam-HNAG flows and discrete variants with Lyapunov-based accelerated convergence guarantees. |
| 2026-04-06 | Muon Spectral Wasserstein Flow | Derives continuous-time mean-field dynamics for normalized matrix flows, defining Spectral Wasserstein distances where the operator-norm case captures Muon geometry and Schatten norms interpolate with classical W2. |
| 2026-04-06 | Muon-Accelerated Tensor GLM | Applies Muon-style orthogonalized acceleration to low-separation-rank tensor generalized linear models, preserving tensor structure while improving the block-coordinate estimation used in LSR tensor regression. |
| 2026-04-05 | SIFT/Subspace Control | Frames constrained model steering as spectral subspace-control optimization, using subspace orthogonalization to reduce interference between the primary objective and safety/privacy/task constraints. |
| 2026-04-01 | Newton-Muon | Derives an optimizer from a quadratic surrogate involving the gradient, output-space curvature, and input data matrix, adding input-side Newton preconditioning to Muon-style orthogonalized updates. |
| 2026-03-30 | HyperP | Builds hypersphere parameterization for Muon-style Frobenius-norm-constrained training, transferring optimal learning rates across width, depth, token budget, and MoE granularity under a fixed-norm parameterization. |
| 2026-03-30 | MuonEq | Balances the momentum matrix before Newton-Schulz using row, column, or two-sided lightweight equilibration, improving the singular-value geometry seen by finite-step Muon orthogonalization. |
| 2026-03-27 | Sharp Capacity Scaling | Uses linear associative memory as a tractable factual-recall model to compare one-step recovery rates for Muon, SGD, and Newton, characterizing when spectral optimizers recover overcomplete associations faster. |
| 2026-03-20 | RMNP Code | Replaces Muon's Newton-Schulz preconditioner with row-momentum normalization, targeting O(mn) matrix preconditioning that keeps much of Muon's benefit with lower wall-clock overhead. |
| 2026-03-18 | MUD | Substitutes Muon's repeated polar-factor matrix multiplications with a triangular Cholesky/Gauss-Seidel-inspired whitening surrogate, aiming to decorrelate momentum faster on transformer matrices. |
| 2026-03-16 | Hyperparameter Scaling Laws | Uses convergence bounds for LMO-based optimizers, including normalized SGD, signSGD/Adam proxies, and Muon, to derive scaling rules for batch size and training horizon beyond model-size-only transfer. |
| 2026-03-16 | Muon Heavy-Tailed Convergence | Proves Muon convergence for nonconvex Holder-smooth empirical risk under bounded heavy-tailed stochastic noise, weakening the usual bounded-variance assumptions common in optimizer theory. |
| 2026-03-15 | SPECTRA | Adds post-update spectral clipping and optional pre-filtering to control large update spectral norms and sparse spectral noise spikes, making spectral structure explicit even for AdamW-style optimizers. |
| 2026-03-10 | HTMuon | Uses Heavy-Tailed Self-Regularization theory to modify Muon so updates preserve heavier-tailed spectra instead of over-flattening noise directions, improving LLM pretraining and image-classification performance in experiments. |
| 2026-03-10 | Mousse | Adds curvature-aware preconditioning to Muon so the optimizer does not apply uniform spectral steps across highly anisotropic curvature directions, reducing high-curvature instability and flat-direction underprogress. |
| 2026-03-10 | MOGA | Interprets AdamW, Muon, and related methods as steepest descent under mean-normalized matrix operator norms, deriving row/column-normalized updates for width-stable hyperparameter transfer. |
| 2026-03-04 | NuMuon | Adds a nuclear-norm constraint to Muon to encourage compressible low-rank weight structure while preserving Muon-style full-rank training benefits and favorable convergence behavior. |
| 2026-02-28 | Muon Simplicity Bias | Investigates downside biases introduced by Muon's speed-oriented spectral updates, analyzing cases where faster optimization can trade off with the simplicity bias often associated with generalization. |
| 2026-02-28 | MuonRec Code | Adapts Muon orthogonalized momentum to scalable recommendation training, challenging AdamW defaults and reporting fewer training steps plus improved ranking quality in generative recommender models. |
| 2026-02-27 | LoRA-Pre Code | Reinterprets optimizer momentum EMAs as online linear regressors, then stores Adam/Muon momentum in a low-rank LoRA-style subspace to reduce optimizer-state memory for pretraining and fine-tuning. |
| 2026-02-26 | LITE Code | Uses a Riemannian dynamics view to show Muon and SOAP can be too conservative in flat directions, then increases flat-direction damping/learning rates to accelerate LLM pretraining. |
| 2026-02-26 | FlashOptim Code | Reduces mixed-precision training memory by combining optimizer-state quantization with master-weight splitting while preserving API compatibility and model quality for large-model training. |
| 2026-02-25 | MUON+ Code | Identifies post-polar row/column imbalance after practical Newton-Schulz Muon steps and adds one normalization step to improve blockwise descent without changing Muon's core orthogonalized-momentum design. |
| 2026-02-24 | Spectral Conditions for muP | Extends maximal-update-parameterization ideas to spectral optimizers by deriving conditions for feature-learning hyperparameter transfer across Shampoo, Muon, and related matrix methods. |
| 2026-02-19 | ZO-Muon | Combines zeroth-order finite-difference gradient estimation with subspace projection and Muon-style orthogonalization, reducing query variance while exploiting low-rank update structure in memory-efficient fine-tuning. |
| 2026-02-19 | NAMO | Integrates Muon orthogonalized momentum with Adam-type noise adaptation through a norm-based adaptive stepsize, including a diagonal NAMO-D variant for additional stochastic stability. |
| 2026-02-18 | Adam/Muon Implicit Bias | Shows momentum steepest-descent methods such as Muon, Signum, and MomentumGD follow approximate norm-specific steepest-descent trajectories and converge toward KKT points of corresponding max-margin problems. |
| 2026-02-18 | SpecMuon | Adds spectral guidance and mode-wise RSAV step control to Muon for physics-informed neural networks and operators, tempering unit-singular-value steps in stiff multi-scale scientific-learning losses. |
| 2026-02-17 | Magma | Finds random update masking can regularize adaptive optimizers through curvature-dependent smoothing, then introduces momentum-aligned gradient masking as a dense-optimizer alternative that outperforms Adam/Muon in reported LLM pretraining. |
| 2026-02-13 | TrasMuon | Keeps Muon's near-isometric orthogonalized direction but restores magnitude control through global RMS calibration and energy-based trust-region clipping to reduce step-size sensitivity and high-energy bursts. |
| 2026-02-12 | Muon Quadratic Insights | Uses simple strongly convex quadratic examples to show local one-step proxies and worst-case polar-error bounds miss key Muon behavior, motivating a dynamical view of spectral optimizer trajectories. |
| 2026-02-12 | Mini-batch Steepest Descent Bias | Characterizes mini-batch stochastic steepest descent under entry-wise and Schatten-p norms, showing how batch size, momentum, and variance reduction shape max-margin implicit bias for SignSGD- and Muon-like methods. |
| 2026-02-10 | Clarifying Shampoo | Decomposes Shampoo updates into an adapted Muon-like spectral step and shows its stochastic/trajectory adaptation explains why Shampoo can be more token-efficient than plain Muon, paralleling Adam versus Signum. |
| 2026-02-09 | Pion/Leon | Builds adaptive operator-norm matrix online-learning algorithms by smoothing nuclear-norm potentials, yielding Pion/Leon-style methods with regret guarantees and nonsmooth nonconvex optimization rates. |
| 2026-02-08 | TSR-Adam | Introduces two-sided low-rank synchronization for Adam-family distributed training, communicating a compact U^T G V core to reduce bandwidth and memory relative to one-sided low-rank approaches. |
| 2026-02-07 | Sign-Based Heavy-Tail Optimizers | Explains empirical gains of Lion, Muon, and other sign-based optimizers through heavy-tailed gradient noise, giving generalized noise conditions under which sign updates outperform variance-adaptive methods. |
| 2026-02-06 | Unified Vector/Matrix Adaptivity | Decomposes AdaGrad into variance adaptation and scale-invariant update factors, using that split to bridge Adam-style vector adaptivity with Muon-style matrix spectral optimization. |
| 2026-02-06 | Muon LoRA Spectral Growth | Analyzes Muon/SpecGD dynamics in LoRA-style matrix factorization, showing LoRA product singular values grow nearly uniformly even when orthogonalization is applied separately to the low-rank factors. |
| 2026-02-05 | Norm-Constrained Warm-Up | Derives warm-up-then-decay schedules for norm-constrained optimizers such as Muon and Lion from a generalized smoothness assumption where curvature falls with suboptimality, replacing heuristic warm-up with adaptive scheduling theory. |
| 2026-02-05 | Muon Associative Memory | Studies Muon in a softmax linear associative-memory model with hierarchical frequency structure, showing spectral updates reduce frequency-dependent learning imbalance that slows gradient descent. |
| 2026-02-05 | ADANA | Introduces logarithmic-time schedules for AdamW momentum and weight decay, letting gradient-memory horizons grow with training time and using damping to stabilize improved language-model scaling. |
| 2026-02-04 | Canzona | Provides an asynchronous, load-balanced framework for distributed matrix optimizers such as Shampoo, Muon, and SOAP, reconciling holistic matrix updates with tensor sharding in Megatron-style training. |
| 2026-02-04 | BeyondMuon Code | Views Muon as the p=0 endpoint of U Sigma^p V^T spectral transformations and evaluates intermediate RMS/spectral variants to clarify how Muon relates to Adam-style adaptivity. |
| 2026-02-03 | PRISM Spectral Shaping | Adds low-rank quasi-second-order information to first-order spectral descent through innovation-augmented polar decomposition, suppressing high-variance subspaces while preserving signal-dominated directions with little overhead. |
| 2026-02-03 | Non-Euclidean GNS | Generalizes gradient-noise-scale batch-size adaptation to signSGD/Signum and spectral-descent/Muon geometries, replacing Euclidean SGD assumptions with norm-matched stochastic noise estimates. |
| 2026-02-01 | OLion | Combines Lion-style sign momentum with approximate Newton-Schulz orthogonalization and a final entrywise sign step, approximating steepest descent over intersected spectral and l-infinity constraints. |
| 2026-01-30 | Spectra | Identifies persistent spike-tail anisotropy in LLM gradients and proposes spike-aware spectral suppression so dominant low-rank directions do not throttle tail learning, stability, and downstream quality. |
| 2026-01-30 | TEON | Generalizes layer-wise Muon by treating network gradients as structured higher-order tensors and orthonormalizing across tensor modes, with convergence arguments and LLM pretraining experiments. |
| 2026-01-30 | Spectral GD Phase Retrieval | Uses anisotropic phase retrieval as a model problem to show spectral gradient updates can avoid misalignment caused by dominant covariance directions that distract ordinary gradient descent. |
| 2026-01-30 | Mano | Revisits manifold optimization for LLMs by projecting momentum onto tangent spaces and constraining updates on rotational oblique manifolds, seeking less memory/compute than AdamW while preserving curvature structure better than Muon. |
| 2026-01-29 | PRISM Matrix Functions | Accelerates matrix square roots, inverse roots, and orthogonalization used by Shampoo/Muon through adaptive polynomial fitting plus randomized iterative sketching, avoiding eigendecomposition while improving GPU efficiency. |
| 2026-01-29 | FISMO | Combines Fisher-structured adaptivity with Muon-style momentum orthogonalization so update spectra retain curvature information instead of being forced into strict isotropic singular values. |
| 2026-01-29 | MCSD/SPEL | Extends norm-constrained LMO optimizers such as spectral descent and Muon to manifold constraints with a single-loop method: choose a norm-induced Riemannian steepest direction and project back to the manifold. |
| 2026-01-27 | Muon Convergence Rates | Provides a simpler direct analysis of Muon for nonconvex optimization, improving convergence-rate guarantees without relying on restrictive assumptions about the orthogonalized update rule. |
| 2026-01-27 | Muon with Newton-Schulz | Analyzes practical Muon with finite Newton-Schulz orthogonalization, proving it matches ideal SVD-polar Muon convergence up to a constant that improves doubly exponentially with the number of NS steps. |
| 2026-01-21 | Variance-Adaptive Muon | Introduces Muon-NSR and Muon-VS, applying variance-adaptive normalization to momentum before orthogonalization to combine Adam-like stochastic robustness with Muon matrix geometry. |
| 2026-01-20 | Muon Spectral Orthogonalization | Studies simplified Muon in matrix factorization and linear-transformer in-context learning, giving end-to-end explanations of how spectral orthogonalization acts as useful preconditioning. |
| 2026-01-18 | IFNSO | Collapses iterative Newton-Schulz orthogonalization into an iteration-free polynomial formulation by analyzing matrix-power contributions, reducing repeated high-dimensional matmul overhead in Muon/Stiefel settings. |
| 2026-01-13 | Spectral Sphere Optimizer (SSO) Code | Constrains both module weights and updates on a spectral sphere, aligning Muon-style update control with muP-like activation stability so weights cannot drift while updates remain norm-controlled. |
| 2026-01-04 | Principled Muon under muP | Studies how to maintain muP spectral conditions throughout training with matrix optimizers such as Muon, aiming to preserve width-independent dynamics and hyperparameter transfer in practical runs. |
| 2025-12-18 | MVR-Gluon | Adds momentum variance reduction to the Gluon framework, giving one theory path that covers Muon, Scion, and other LMO optimizers and proves faster stochastic convergence than vanilla momentum. |
| 2025-12-15 | HCM-LMO | Injects Hessian-corrected momentum into arbitrary-norm LMO optimizers such as Muon, Scion, and Gluon, using second-order information to improve rates beyond standard stochastic momentum bounds. |
| 2025-12-10 | Fanions | Constructs Muon-like optimizers from duals of Ky-Fan norms and their Frobenius/l-infinity combinations, yielding Fanions, F-Muon, and S-Muon families that interpolate matrix-update geometries. |
| 2025-12-05 | Matrix-Preconditioned Hyperparameter Transfer | Studies learning-rate and weight-decay scaling for matrix-preconditioned optimizers such as Shampoo, SOAP, and Muon, using hyperparameter transfer to make gains consistent across Llama model sizes. |
| 2025-12-04 | Turbo-Muon | Preconditions the matrix before Newton-Schulz orthogonalization so Muon reaches useful polar accuracy with fewer matrix multiplications, making the overhead of acceleration nearly negligible. |
| 2025-12-03 | Spectral Gradient Conditions | Derives a layer-wise diagnostic comparing gradient nuclear/Frobenius ratio with activation stable rank to predict when a Muon-style spectral step should beat a Euclidean gradient step. |
| 2025-10-07 | NorMuon Code | Combines Muon orthogonalization with neuron-wise normalization and second-order statistics, targeting better scalability and efficiency by jointly leveraging Adam-like adaptivity and Muon matrix geometry. |
| 2025-10-06 | DP-Adam-AC | Adds adaptive clipping to differentially private Adam fine-tuning for localizable LLMs, improving the privacy-utility trade-off when task-specific models must be fine-tuned under privacy constraints. |
| 2025-10-04 | Hill-ADAM | Augments Adam with deterministic state-space exploration through alternating minimization/maximization phases, aiming to escape local minima rather than settling at the first basin visited. |
| 2025-09-29 | Conda | Combines Adam coordinate adaptivity with column-normalized updates, addressing Adam's low-rank/spectrally poor update structure while retaining per-coordinate variance scaling for faster LLM training. |
| 2025-09-19 | AdaGrad++ | Proposes a simpler parameter-free AdaGrad variant with convergence guarantees, removing manual learning-rate tuning while preserving AdaGrad-like rates in convex optimization. |
| 2025-09-19 | Adam++ | Proposes a parameter-free Adam variant with convergence guarantees that match Adam-style rates without assuming preselected learning-rate conditions, reducing manual tuning burden. |
| 2025-09-03 | KL-Shampoo / KL-SOAP Code | Recasts Shampoo/SOAP second-moment estimation as KL-divergence covariance estimation rather than Frobenius fitting, producing KL preconditioners that reduce Adam grafting or Adam-in-eigenbasis overhead. |
| 2025-09-01 | ZO Fine-Tuner | Learns a compact perturbation strategy for zeroth-order LLM fine-tuning, training the optimizer once per foundation model and reusing it across downstream tasks to beat hand-designed ZO baselines. |
| 2025-09 | SRON | Uses row-wise gradient normalization for state-free LLM training to reduce optimizer-state overhead while stabilizing matrix updates. |
| 2025-07-15 | AdaMuon | Applies element-wise second-moment scaling to Muon's orthogonalized directions, with sign-stabilized momentum and RMS-aligned rescaling to combine variance adaptivity with stable matrix geometry. |
| 2025-06-20 | SCALE Code | Finds column-wise gradient normalization plus last-layer momentum are sufficient minimalist changes to SGD for competitive LLM pretraining, matching Adam with much lower optimizer-state memory. |
| 2025-06-08 | SPlus Code | Stabilizes Shampoo-style whitening by combining historical eigenbases with instantaneous normalization and shape-aware scaling, reducing divergence from stale inverse caches and improving wall-clock efficiency. |
| 2025-05-27 | PolarGrad | Unifies matrix-aware preconditioned optimizers and introduces polar-decomposition update rules that explain links among Adam/Shampoo/Muon-style methods while improving convergence in reported experiments. |
| 2025-05-19 | Gluon | Bridges theory and practice for LMO-based optimizers by generalizing Muon and Scion into a layer-wise framework with better memory efficiency, hyperparameter transfer, and LLM training performance. |
| 2025-04-07 | Dion Code | Replaces Muon's dense Newton-Schulz orthogonalization with distributed amortized power iteration plus error feedback, enabling low-rank orthonormalized updates compatible with sharded LLM training. |
| 2025-02-24 | COSMOS Code | Splits matrix-gradient subspaces between SOAP and Muon: it applies richer adaptive preconditioning to leading eigendirections and cheaper Muon-style updates to the remaining space for memory-efficient LLM training. |
| 2025-02-24 | D-Muon Code | Scales Muon to larger LLM training by adding weight decay and per-parameter update-scale adjustment, reporting about 2x compute efficiency over AdamW under compute-optimal scaling and releasing the Moonlight recipe. |
| 2025-02-11 | Scion Code | Introduces stochastic LMO-over-norm-ball optimizers for unconstrained deep learning, choosing norms that improve memory efficiency and hyperparameter transfer while giving nanoGPT speedups. |
| 2024-12-08 | Muon Code | Defines Muon as SGD/Nesterov momentum followed by Newton-Schulz orthogonalization of 2D hidden-layer updates, usually paired with AdamW for embeddings, heads, and non-matrix parameters. |
| 2024-12-02 | PROFIT | Designs an optimizer specifically for deep fine-tuning converged models, using temporal gradient orthogonalization and assumptions about pretrained weights to regularize task adaptation beyond generic SGD/Adam. |
| 2024-11-25 | Cautious Optimizers Code | Adds a one-line cautious mask to momentum optimizers, preserving Adam-style Lyapunov convergence while empirically improving transformer training for C-AdamW, C-Lion, and related variants. |
| 2024-11-15 | MARS Code | Brings variance-reduction ideas into large-model optimization by correcting adaptive/sign optimizer updates with gradient-difference signals, producing MARS variants that improve GPT-style training efficiency. |
| 2024-11-11 | Subset-Norm + Subspace-Momentum Code | Combines Subset-Norm step sharing, which reduces AdaGrad memory from O(d) to O(sqrt(d)), with Subspace-Momentum to lower optimizer-state cost while retaining convergence guarantees. |
| 2024-11-05 | ADOPT Code | Modifies Adam so it converges at the optimal O(1/sqrt(T)) rate for any beta2 without bounded-gradient-noise assumptions, addressing Adam non-convergence while keeping adaptive-gradient practicality. |
| 2024-10-25 | COAT Code | Compresses optimizer states and activations into FP8 training through dynamic range expansion and related memory techniques, reducing memory footprint beyond FP8 linear-layer-only frameworks. |
| 2024-10-21 | LDAdam Code | Performs Adam-style adaptivity in changing low-dimensional projected subspaces, using projection-aware state updates and generalized error feedback to lower memory while still exploring full parameter space. |
| 2024-09-17 | SOAP Code | Shows Shampoo with half-power preconditioning is equivalent to Adafactor in Shampoo's eigenbasis, then stabilizes and simplifies Shampoo by combining it with Adam-style moment updates. |
| 2024-09-05 | AdEMAMix Code | Adds a second, slower EMA of older gradients to AdamW-style momentum so the optimizer can use both recent directions and longer-term gradient memory, improving token efficiency. |
| 2024-06-24 | Adam-mini Code | Cuts Adam memory by replacing most per-parameter second-moment learning rates with block-level rates chosen from Hessian-structure principles, retaining AdamW-like performance at about half memory. |
| 2024-05-24 | SF-AdamW (Schedule-Free) Code | Removes explicit learning-rate schedules by using schedule-free momentum/iterate-averaging theory, matching scheduled training performance without knowing the stopping step in advance. |
| 2024-05-24 | MicroAdam Code | Compresses gradients before feeding them into Adam states and uses compressed error feedback, reducing optimizer-state memory while preserving convergence guarantees. |
| 2024-05-21 | FAdam Code | Interprets Adam as a diagonal empirical-Fisher natural-gradient method, identifies approximation flaws in standard Adam, and proposes corrections to momentum, bias correction, and epsilon handling. |
| 2023-12-04 | AGD Code | Builds an auto-switchable preconditioner from stepwise gradient differences, using Hessian-related diagonal information to choose between adaptive and non-adaptive behavior during training. |
| 2023-10-16 | AdaLOMO Code | Adds adaptive learning rates to low-memory full-parameter LLM fine-tuning, keeping LOMO-style memory savings while closing much of the convergence gap to AdamW. |
| 2023-09-05 | AdaPlus Code | Combines AdamW, NAdam, and AdaBelief ideas by adding Nesterov momentum and precise step-size adjustment without extra hyperparameters, improving language-model and vision training baselines. |
| 2023-07-28 | CoRe Code | Benchmarks Continual Resilient optimization as a robust all-in-one first-order optimizer, emphasizing smooth convergence, low tuning burden, and broad applicability across machine-learning tasks. |
| 2023-07-18 | Adam+CM Code | Augments Adam with a memory buffer of critical momentum terms that intentionally overshoot narrow minima, encouraging exploration toward flatter basins and better generalization. |
| 2023-07-05 | CAME Code | Uses confidence-guided memory-efficient second-moment estimation to reduce instability in Adafactor-like optimizers, aiming for Adam/LAMB quality with much lower auxiliary memory. |
| 2023-06-16 | LOMO Code | Fuses gradient computation and parameter updates so full-parameter LLM fine-tuning can run under limited GPU memory, avoiding the optimizer-state cost of AdamW-style training. |
| 2023-06-09 | Prodigy Code | Improves D-Adaptation by estimating the distance-to-solution parameter needed for optimal learning rates, yielding a parameter-free optimizer that matches tuned methods across many tasks. |
| 2023-05-25 | WSAM Code | Recasts SAM sharpness as an explicit weighted regularization term, deriving generalization bounds and improving or matching standard optimizers on benchmark datasets. |
| 2023-05-25 | DoWG Code | Introduces Distance-over-Weighted-Gradients, a parameter-free optimizer that adapts to smooth and nonsmooth convex problems using a distance-weighted gradient accumulator. |
| 2023-05-23 | Sophia Code | Uses a lightweight diagonal Hessian estimate as a stochastic second-order preconditioner with clipping, reducing language-model pretraining cost relative to Adam-family methods. |
| 2023-05-09 | UAdam | Provides a unified Adam-type framework covering Adam, NAdam, AMSGrad, AdaBound, AdaFom, and Adan through a general second-moment form with nonconvex stochastic convergence analysis. |
| 2023-02-16 | FOSI Code | Wraps any first-order optimizer with low-dimensional second-order corrections by splitting the objective into orthogonal subspaces, using curvature where useful and the base optimizer elsewhere. |
| 2023-02-13 | Lion Code | Discovers a memory-efficient sign-momentum optimizer through symbolic program search, keeping only momentum state and using update signs rather than Adam-style second moments. |
| 2023-02-08 | DoG Code | Sets SGD step sizes dynamically from distance-to-initial-point and gradient norms, removing the learning-rate hyperparameter while matching tuned SGD on vision and language transfer tasks. |
| 2023-01-18 | D-Adaptation Code | Automatically estimates the learning-rate scale for SGD/Adam/AdaGrad variants without line searches or extra gradients, matching hand-tuned rates across diverse convex and ML experiments. |
| 2022-11-17 | VeLO Code | Meta-trains a neural-network optimizer at large scale across many tasks, producing a versatile learned optimizer that transfers with little tuning and exhibits non-hand-designed update behavior. |
| 2022-10-21 | Amos Code | Adds model-oriented adaptive learning-rate decay and weight decay to Adam-style optimization, improving BERT/T5 pretraining speed while using less slot-variable memory than AdamW. |
| 2022-10-12 | AdaNorm Code | Corrects each iteration's gradient norm using adaptive history so SGD/momentum-style optimizers maintain representative update magnitudes and improve CNN convergence. |
| 2022-08-13 | Adan Code | Derives an adaptive Nesterov momentum estimator that avoids extra extrapolation gradients and estimates first/second moments for faster, robust deep-model training. |
| 2022-06-14 | GradaGrad | Modifies AdaGrad's monotonically shrinking denominator so the effective learning rate can both grow and shrink, preserving similar convergence rates while reducing tuning sensitivity. |
These papers mention, compare, or rely on Muon-style optimization, but their main contribution is broader than a standalone Muon-family optimizer.
| Date | Paper | Relation |
|---|---|---|
| 2026-06-14 | Schattor | Introduces a Schatten-norm optimizer family that unifies SGD and Muon-like matrix updates with dimension-free stationarity theory; it is adjacent because the contribution is a broader optimizer framework, not a Muon variant. |
| 2026-06-13 | When to use Schatten-p Norm | Analyzes which Schatten-p geometries are optimal under different scaling/noise regimes, explaining when Muon-like Schatten-infinity updates help and when smaller-p geometries are preferable. |
| 2026-06-12 | ZO Parameter-free LMO | Combines zeroth-order fine-tuning with parameter-free LMO methods to reduce memory and tuning burden, borrowing Muon-style geometry but focusing on ZO/PF optimization broadly. |
| 2026-06-12 | Zeta | Diagnoses coordinate-scale heterogeneity in matrices before Newton-Schulz, then proposes coordinate-adaptive dual whitening; Muon appears mainly as a matrix-aware baseline and motivation. |
| 2026-06-11 | Different Layers Different Manifolds | Tests module-wise manifold assignments for GPT-2 training, finding Stiefel constraints suit attention while DGram suits MLPs; Muon is used through a Manifold Muon lens rather than modified directly. |
| 2026-06-09 | Overcoming Rank Collapse in Feedback Alignment | Studies why feedback alignment fails in deeper networks and uses orthogonalized/Muon-style updates to increase feedback-signal rank, but the target is biologically plausible learning, not optimizer design. |
| 2026-06-04 | PC Layer | Adds a polynomial weight-preconditioning layer that reshapes singular spectra during LLM pretraining and can be merged away at inference; Muon is one optimizer tested with the architectural preconditioner. |
| 2026-06-04 | Double Preconditioning | Optimizes for rollout/test-time performance under feedback mismatch rather than validation loss, treating Muon as one possible gradient-wise preconditioner inside a broader DoPr framework. |
| 2026-06-02 | Ultralytics YOLO26 | Presents a real-time vision model family with NMS-free heads and training changes; Muon-SGD appears as part of the recipe, but the main contribution is architecture/system design. |
| 2026-06-01 | WALL-WM | Builds event-grounded world-action-model pretraining for VLA/video-action learning; Muon is part of the large-scale training stack rather than the object of study. |
| 2026-05-30 | Exploiting Weight-Space Symmetries for Approximating Curvature | Uses weight-space symmetry averaging to construct tractable curvature approximations from single gradients, recovering Shampoo/Muon-like structures as cases of a broader symmetry framework. |
| 2026-05-29 | Mellum2 Technical Report | Reports a 12B MoE code/general model and mentions Muon in the FP8 hybrid-precision training recipe; it is model-report evidence of Muon use, not an optimizer contribution. |
| 2026-05-28 | On the Optimizer Dependence of Neural Scaling Laws | Shows scaling-law exponents vary with optimizer choice in random-feature regression and related settings, using Matrix-Sign/Muon-like methods as one optimizer family under comparison. |
| 2026-05-27 | Parallax | Introduces parameterized local linear attention for language modeling and uses Muon to stabilize/scale training; the optimizer is enabling infrastructure, not the paper's main method. |
| 2026-05-26 | How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks | Compares Adam and Muon across equivariant/geometric models, finding Muon often improves optimization; it analyzes optimizer effects rather than proposing a new optimizer. |
| 2026-05-26 | The Stability of Singular Distribution | Identifies early stabilization of trace-normalized singular spectra during LLM pretraining across schedules and optimizers including Muon, connecting spectral dynamics to two-phase loss curves. |
| 2026-05-23 | Momentum Streams for Optimizer-Inspired Transformers | Interprets Transformer residual updates as optimizer steps and designs architectures inspired by momentum, Adam, Muon, and SOAP; the contribution is architectural, not an optimizer update rule. |
| 2026-05-20 | Same Architecture, Different Capacity | Measures FFN representation spectra and shows AdamW and Muon produce different spectral scaling laws at fixed architecture, framing optimizer choice as a capacity-scaling axis. |
| 2026-05-18 | Scale-Invariant Neural Network Optimization | Develops theory for scale-invariant optimizers under norm geometry and heavy-tailed noise, including Muon and Scion as examples within a broader optimization principle. |
| 2026-05-09 | Navigating LLM Valley | Surveys LLM optimizer design from AdamW through memory-efficient and matrix-based methods, positioning Muon within the broader optimizer landscape rather than introducing a method. |
| 2026-05-07 | Optimizer-Model Consistency | Shows full fine-tuning with the same optimizer used in pretraining can reduce forgetting, comparing Muon and AdamW behavior as evidence for optimizer-shaped model states. |
| 2026-04-16 | Benchmarking Optimizers for MLPs in Tabular Deep Learning | Benchmarks 15 optimizers for tabular MLPs and finds Muon strong against AdamW; it is empirical optimizer evaluation rather than new algorithm design. |
| 2026-04-02 | Normalization-Optimizer Coupling | Tests normalization-layer choices with AdamW and Muon at 1B scale, showing Dynamic Erf interacts badly with Muon while RMSNorm/DyT behave differently. |
| 2026-03-30 | Spectral Edge Dynamics | Uses rolling-window spectra of parameter updates to analyze phase transitions such as grokking and loss plateaus; optimizer spectra are diagnostic tools, not new update rules. |
| 2026-02-25 | veScale-FSDP | Improves FSDP/ZeRO infrastructure for block-structured computations and non-element-wise optimizers such as Shampoo and Muon, focusing on sharding/runtime support. |
| 2026-02-07 | Robust Scaling Laws for Optimizers | Studies Chinchilla-style and optimizer-specific scaling laws across AdamW, Muon, Shampoo, SOAP, and others, asking how optimizer choice changes compute/data/model scaling. |
| 2026-01-31 | Data Distribution as an Optimizer Lever | Analyzes whether changing training data distribution can steer optimizer generalization behavior, comparing GD and SAM; it is adjacent to optimizer choice but not Muon-specific. |
| 2026-01-14 | Muon-Optimized Distillation and Quantization | Combines GPTQ quantization, LoRA, distillation, and Muon-based fine-tuning in a deployment pipeline for compressed LLMs, using Muon as a component rather than proposing it. |
| 2026-01-08 | Learnable Multipliers | Adds learnable scalar multipliers to matrix layers to escape weight-decay/noise equilibrium norms, validating the scaling idea under Adam and Muon. |
| 2025-12-16 | Optimizing Rank for INRs | Argues vanilla MLP INRs fail at high frequencies due to stable-rank degradation and uses Muon-like high-rank updates/rank regularization to preserve representation rank. |