Awesome Optimizers List

Curated optimizer-design papers from 2022+, ordered by exact publication or submission date in reverse chronological order.

CSV: data/optimizers.csv
Weakly related Muon papers: data/muon_weakly_related.csv
Date source: arXiv published date for arXiv papers; source-page submission date for OpenReview/blog entries.
Note: SRON is a legacy unsourced row without a paper URL, so its date remains month-level until the source paper is verified.

Date	Optimizer Name	Summary
2026-06-16	MGUP Code	Selects a fixed fraction of parameters for larger momentum-gradient-aligned steps while giving the rest smaller nonzero updates; works as a plug-in wrapper for AdamW, Lion, and Muon with convergence guarantees and LLM training experiments.
2026-06-15	Hyperball Optimization	Constrains both weight-matrix and optimizer-update Frobenius norms to fixed constants around a base optimizer such as Adam or Muon; reports 20-30% token-equivalent speedups on Qwen3-style models up to 1.2B and better learning-rate transfer.
2026-06-15	CacheMuon	Caches temporally correlated polar-factor information from previous steps to precondition Muon's Newton-Schulz computation, reducing repeated orthogonalization cost while exposing a quality-efficiency trade-off.
2026-06-12	Free Heavy-Tailed Lunch for Muon	Gives heavy-tailed nonconvex theory showing matrix-valued non-Euclidean optimizers such as Muon and Scion can avoid dimension-dependent costs and achieve stronger stationarity guarantees than Euclidean methods.
2026-06-11	Muon^p	Interpolates between gradient descent and Muon by using fractional spectral-power updates U S^p V^T; derives practical low-degree bivariate matrix recurrences because fixed univariate polynomial iterations cannot compute the required powers.
2026-06-11	LoRA-Muon	Derives a Muon-style spectral steepest-descent rule on the low-rank LoRA manifold, paired with split weight decay, to reduce initialization and stepsize sensitivity and improve rank/width/depth learning-rate transfer.
2026-06-09	FOGO	Frames forgetting as step-level gradient interference, then spectrally orthogonalizes momentum updates and keeps a compact codebook of past directions so dominant minibatch gradients do not erase rare useful directions.
2026-06-08	Muon Robust Transfer	Evaluates pretrained models under corrupted image/text shifts and layer-wise probes, finding Muon-trained features more robust and transferable than Adam or SGD features across transformer and CNN settings.
2026-06-07	OptMuon	Combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm coefficient schedule, turning orthogonalized momentum into a closed-loop method that calibrates update magnitudes from observed optimization history.
2026-06-07	Muon Spectral Dynamics	Analyzes Muon's polar update as a flat-spectrum, entropy-maximizing bias under alignment assumptions, deriving singular-value dynamics and showing how the update geometry changes noise behavior rather than merely rescaling gradients.
2026-06-03	Why Muon Outperforms Adam: A Curvature Perspective	Uses second-order Taylor decomposition to show Muon and Adam have similar first-order gain at matched validation loss, while Muon pays a smaller curvature penalty through lower normalized directional sharpness.
2026-06-02	Spectral Scaling Laws of Muon	Measures singular-value quantiles of Muon momentum matrices across model sizes and layers, identifying which directions finite Newton-Schulz misses and turning the spectra into layer-aware orthogonalization guidance.
2026-06-02	Denoise First, Orthogonalize Later	Models Muon momentum as a spectral filter that suppresses perturbation modes before orthogonalization, increasing the signal-perturbation gap and stabilizing the singular subspaces passed into Newton-Schulz/polar updates.
2026-06-01	A Note on Stability for Orthogonalized Matrix Momentum	Proves finite-round generalization bounds for client-sampled distributed optimization with orthogonalized matrix momentum, explicitly tracking heterogeneous client sampling and finite-step Newton-Schulz effects.
2026-05-29	How Much Orthogonalization Does Muon Need?	Studies relaxed low-precision Newton-Schulz schedules for Muon and shows cheaper partial orthogonalization can preserve training behavior, separating the need for useful spectral shaping from exact polar accuracy.
2026-05-29	Softsign / SoftMuon	Introduces SoftSignum, replacing hard sign updates with a temperature-controlled soft-sign map so updates can move between sign-like and magnitude-sensitive behavior; extends the same relaxation to matrix optimizers as SoftMuon with quantile temperature scheduling.
2026-05-26	Entry-Wise Clipping for Muon	Models language-model gradient noise as entry-wise heavy-tailed contamination, derives an entry-wise clipping surrogate that can control spectral noise, and positions it as a cheaper structural alternative or complement to Muon-style spectral normalization.
2026-05-26	Spectral Descent (SD/TSD)	Studies simplified Muon-like Spectral Descent and Truncated Spectral Descent under non-smooth convex objectives, proving linear convergence under sharpness and connecting regularized variants to Frank-Wolfe-style sublinear guarantees.
2026-05-26	Muon Adversarial Training	Tests whether Muon-style orthogonalized matrix updates improve adversarial training, deriving a spectral-norm stability ceiling and evaluating robustness across architectures and heterogeneous threat models.
2026-05-26	MONA	Adds a Nesterov-like acceleration term from an EMA of gradient differences directly into Muon before orthogonalization, with convergence analysis arguing it helps escape sharp minima while preserving Muon spectral regularization.
2026-05-26	MuCon	Replaces Muon's polar direction, which maps all singular values to one, with singular-value-clipped updates under a SpectralP scaling recipe, using a spectral-norm-ball projection to retain controlled magnitude information.
2026-05-25	EMA-Nesterov	Reinterprets Nesterov acceleration as trajectory extrapolation and replaces noisy one-step lookahead with an EMA of parameter updates, yielding a wrapper that stabilizes acceleration for stochastic nonconvex deep-learning training.
2026-05-23	Muon in Vision Transformers	Benchmarks Muon against AdamW for ViT training on ImageNet-100 and Pl@ntNet-300K, showing recipe-dependent gains and linking heavy augmentation to healthier matrix-gradient spectra and less late-training mode collapse.
2026-05-22	Regularized Muon Flow	Interprets smoothed Muon orthogonalization as the gradient of a Fenchel-dual nuclear-norm smoothing, recasting Muon as a mirror/prox update and deriving Hamiltonian probability-gradient-flow dynamics for mean-field training views.
2026-05-21	AMUSE	Explains Muon through a river-valley loss-landscape picture where orthogonalization speeds flat-direction progress but amplifies dominant-direction oscillations; combines Muon with schedule-free iterate averaging for anytime stable gradient evaluation.
2026-05-21	Layerwise Learning Rates (LLR)	Uses Heavy-Tailed Self-Regularization estimates from layer weight spectra to assign larger learning rates to weakly heavy-tailed layers and smaller rates to strongly heavy-tailed layers, reporting faster AdamW and Muon LLM pretraining.
2026-05-19	LionMuon	Alternates cheap Lion/sign-style steps with expensive Muon spectral steps using a shared dual-EMA buffer, matching Lion memory while reducing average cost and reporting Pareto gains over Muon, Lion, Signum, and AdamW.
2026-05-19	Schatten-p Adaptive Optimization	Derives a data-driven criterion from gradient and activation statistics to choose layer-wise Schatten-p LMO geometries, interpolating between SGD, Muon, Adam, and MuAdam-style updates instead of fixing the optimizer geometry.
2026-05-19	MiMuon	Analyzes Muon generalization via algorithmic stability, then mixes Muon orthogonalized gradients with momentum SGD to improve generalization while retaining fast matrix-aware convergence for large models.
2026-05-19	High-Pass Pion	Identifies failure modes of full Muon whitening in VLA training and RLVR, where low-rank or low-SNR gradients make tail amplification harmful; proposes Pion as a high-pass spectral filter that promotes useful directions while suppressing noise.
2026-05-18	Distance-Aware Muon	Develops adaptive step-scaling rules for normalized Muon directions, using trajectory distance, scale calibration, or descent certificates to set trust-region radii and reduce manual global step-size tuning.
2026-05-18	Ringmaster LMO	Extends Muon/LMO-style momentum to heterogeneous distributed training by allowing asynchronous delayed gradients, aiming to avoid straggler bottlenecks while preserving LMO update geometry.
2026-05-18	Symmetry-Compatible Optimizers	Assigns row-norm, spectral, or hybrid optimizer geometries according to parameter symmetry groups, extending Muon-style equivariance so each block receives an update compatible with its transformation structure.
2026-05-18	AMO	Adapts how often each matrix is orthogonalized by estimating Newton-Schulz difficulty from layer geometry, spending Muon computation where it matters and reducing unnecessary matrix-multiply cost elsewhere.
2026-05-16	DynMuon	Generalizes Muon from the polar factor U V^T to dynamic spectral shaping U Sigma^p V^T, tuning the exponent during training so updates can vary between gradient-preserving and spectrum-flattening behavior.
2026-05-13	Muon Spectral Flattening	Analyzes Muon as a spectral-flattening operation that enlarges stable learning-rate ranges and accelerates convergence by redistributing update energy across singular directions rather than following raw gradient magnitudes.
2026-05-13	DP-Muon	Combines differentially private per-example clipping and Gaussian noise with momentum and Newton-Schulz orthogonalization, adapting Muon to private training while studying privacy, utility, and spectral-update stability.
2026-05-12	Spectral Preconditioning	Formulates constrained stochastic spectral preconditioning as a proximal extension of Muon and Scion, giving theory for heavy-tailed noise and matrix-norm geometries where spectral updates can outperform Euclidean ones.
2026-05-12	Pion	Uses non-additive left and right orthogonal transforms to alter Muon-style update spectra while preserving singular values, separating directional rotation from full polar flattening in matrix optimization.
2026-05-12	MuonQ Code	Compresses Muon optimizer states to 4-bit precision by optimizing directional fidelity, targeting memory reduction while preserving the orthogonalized update directions needed for training quality.
2026-05-11	Error Whitening	Frames Gauss-Newton improvements as whitening prediction errors in function space, comparing the induced dynamics with Newton, Adam, and Muon to explain when curvature-aware whitening can help.
2026-05-11	Freon/Kaon	Tests Muon-like optimizers with Schatten and randomized spectra, arguing that update alignment and descent potential can matter more than exactly matching a particular matrix geometry.
2026-05-11	Muon Fine-tuning Transfer	Studies optimizer mismatch when switching Adam-pretrained models to Muon for fine-tuning, showing forgetting and instability correlate with update strength and proposing transfer procedures for safer Muon fine-tuning.
2026-05-11	Optimizer-Induced Mode Connectivity	Shows same-optimizer solution sets can be connected while AdamW and Muon regions may be separated by loss barriers, using theory and GPT-2 pretraining paths to expose optimizer-dependent implicit regularization.
2026-05-11	Muown	Diagnoses Muon spectral-norm drift as row-magnitude growth rather than row-coherence change, then treats row magnitudes as explicit optimizer variables while applying Muon to the remaining direction component.
2026-05-11	SODA	Unifies optimizers including Muon, Lion, AdEMAMix, and NAdam as optimistic dual-averaging methods, then proposes a SODA wrapper with a 1/k weight-decay schedule to reduce weight-decay tuning.
2026-05-10	Muon Phase Analysis	Derives deterministic dynamics for stochastic SignSVD/Muon-like spectral optimizers on matrix least squares, identifying batch-size phases where Muon preconditions covariance spectra and when it degenerates toward SGD-like behavior.
2026-05-10	Dimension-Free Muon Escape	Analyzes Muon saddle-point escape in high-dimensional landscapes and proves its nonlinear spectral shaping can avoid dimension-dependent trapping that affects element-wise adaptive optimizers such as AdamW.
2026-05-10	Intrinsic Muon	Lifts Muon-style linear minimization oracles to Riemannian matrix manifolds by defining intrinsic tangent-space norms from the metric, preserving quotient symmetries for low-rank, orthogonal, and SPD parameters.
2026-05-09	ZO Partial Orthogonalization	Adapts spectral optimization to zeroth-order LLM fine-tuning, showing full orthogonalization is too noisy and proposing power-iteration-based partial orthogonalization to exploit weak spectral directions safely.
2026-05-09	Muon Non-Convergence	Proves Muon fails to converge on convex Lipschitz functions for any learning-rate schedule, then shows error feedback restores theoretical convergence even though it can hurt empirical image-classification performance.
2026-05-09	Muon-OGD	Builds a continual-learning projection method using Muon-style spectral-norm geometry, replacing Frobenius orthogonal-gradient projection with spectral-aware updates for matrix-valued LLM parameters.
2026-05-09	Group Muon	Compares full-matrix, head-wise, and grouped Muon for attention projections, deriving a trade-off between group-wise whitening gain and grouping-induced norm cost and tuning group size as an optimizer hyperparameter.
2026-05-08	PolarAdamW	Applies Muon's Newton-Schulz polar map to AdamW-preconditioned directions, separating polar spectral control from Schur gauge-equivariance and testing the hybrid on DeiT-Tiny vision training.
2026-05-08	OrScale-LM	Adds a layer-wise trust ratio to Muon using the Frobenius norm of the actual parameter-space update direction, avoiding failure modes of naive Muon-LAMB hybrids and calibrating language-model layers at initialization.
2026-05-07	Orth-Dion	Shows Dion's column-normalized low-rank approximation misses the rank-r polar factor targeted by Muon, then orthogonalizes the compressed factor to remove geometric mismatch in distributed spectral optimization.
2026-05-07	Nesterov Muon	Develops convergence theory for practical Muon with Nesterov momentum, heavy-tailed stochastic gradients, and inexact/randomized polar decomposition, quantifying how approximation errors propagate.
2026-05-07	SignSGD/Muon Lower Bounds Code	Uses l1 stationarity, l-infinity smoothness, and separable noise assumptions to derive matching lower and upper bounds explaining when sign-based methods such as SignSGD and Muon can beat SGD.
2026-05-07	Pro-KLShampoo	Observes spike-and-flat spectra in KL-Shampoo Kronecker preconditioners and projects the structured preconditioned direction through orthogonalization, recovering whitening in a Muon-adjacent optimizer.
2026-05-07	Implicit Gradient Transport	Introduces LMO-IGT, using implicit gradient transport to accelerate LMO-based optimizers like Lion and Muon without extra gradient evaluations, alongside a regularized support-function stationarity measure.
2026-05-05	Aurora Code	Diagnoses Muon row-leverage anisotropy on tall matrices as a cause of dead MLP neurons, then adds leverage-aware row normalization/equilibration so rectangular-matrix updates keep useful row mass while preserving polar precision.
2026-05-05	Nora	Projects row-wise momentum onto the orthogonal complement of weights to respect scale invariance, aiming to combine Muon-like matrix preconditioning, stable norm/angular dynamics, and O(mn) update cost.
2026-05-04	SignMuon	Combines Muon polar directions with signSGD-style 1-bit majority-vote communication: workers orthogonalize local momentum, transmit entrywise signs, and optionally apply a local polar correction for bandwidth-efficient distributed training.
2026-04-27	SUDA-Muon	Shows decentralized Muon is hard because matrix-sign orthogonalization does not commute with gossip averaging, then separates primal-dual communication from the nonlinear Muon step using a SUDA template with convergence guarantees.
2026-04-16	CLion	Adds cautious-update masking to Lion and studies generalization via algorithmic stability, giving theory for Lion-style sign momentum and empirical evidence that the cautious variant improves robustness/generalization.
2026-04-12	Federated Gluon	Adapts Gluon/Muon-style LMO optimization to federated learning with unbiased or contraction compressors plus SARAH-style variance reduction, targeting communication-efficient non-Euclidean training.
2026-04-11	Muon^2	Applies Adam-style second-moment preconditioning before Muon orthogonalization so the Newton-Schulz polar approximation starts from a better-conditioned matrix, improving both update quality and iteration efficiency.
2026-04-10	APT for MTL	Studies why multi-task-learning gradient balancing can be weakened by advanced optimizer momentum, then adjusts the interaction with optimizers such as Muon so task de-conflicting affects the actual update direction.
2026-04-09	Adam-HNAG	Reformulates full-batch Adam through variable/operator splitting and curvature-aware gradient correction, yielding Adam-HNAG flows and discrete variants with Lyapunov-based accelerated convergence guarantees.
2026-04-06	Muon Spectral Wasserstein Flow	Derives continuous-time mean-field dynamics for normalized matrix flows, defining Spectral Wasserstein distances where the operator-norm case captures Muon geometry and Schatten norms interpolate with classical W2.
2026-04-06	Muon-Accelerated Tensor GLM	Applies Muon-style orthogonalized acceleration to low-separation-rank tensor generalized linear models, preserving tensor structure while improving the block-coordinate estimation used in LSR tensor regression.
2026-04-05	SIFT/Subspace Control	Frames constrained model steering as spectral subspace-control optimization, using subspace orthogonalization to reduce interference between the primary objective and safety/privacy/task constraints.
2026-04-01	Newton-Muon	Derives an optimizer from a quadratic surrogate involving the gradient, output-space curvature, and input data matrix, adding input-side Newton preconditioning to Muon-style orthogonalized updates.
2026-03-30	HyperP	Builds hypersphere parameterization for Muon-style Frobenius-norm-constrained training, transferring optimal learning rates across width, depth, token budget, and MoE granularity under a fixed-norm parameterization.
2026-03-30	MuonEq	Balances the momentum matrix before Newton-Schulz using row, column, or two-sided lightweight equilibration, improving the singular-value geometry seen by finite-step Muon orthogonalization.
2026-03-27	Sharp Capacity Scaling	Uses linear associative memory as a tractable factual-recall model to compare one-step recovery rates for Muon, SGD, and Newton, characterizing when spectral optimizers recover overcomplete associations faster.
2026-03-20	RMNP Code	Replaces Muon's Newton-Schulz preconditioner with row-momentum normalization, targeting O(mn) matrix preconditioning that keeps much of Muon's benefit with lower wall-clock overhead.
2026-03-18	MUD	Substitutes Muon's repeated polar-factor matrix multiplications with a triangular Cholesky/Gauss-Seidel-inspired whitening surrogate, aiming to decorrelate momentum faster on transformer matrices.
2026-03-16	Hyperparameter Scaling Laws	Uses convergence bounds for LMO-based optimizers, including normalized SGD, signSGD/Adam proxies, and Muon, to derive scaling rules for batch size and training horizon beyond model-size-only transfer.
2026-03-16	Muon Heavy-Tailed Convergence	Proves Muon convergence for nonconvex Holder-smooth empirical risk under bounded heavy-tailed stochastic noise, weakening the usual bounded-variance assumptions common in optimizer theory.
2026-03-15	SPECTRA	Adds post-update spectral clipping and optional pre-filtering to control large update spectral norms and sparse spectral noise spikes, making spectral structure explicit even for AdamW-style optimizers.
2026-03-10	HTMuon	Uses Heavy-Tailed Self-Regularization theory to modify Muon so updates preserve heavier-tailed spectra instead of over-flattening noise directions, improving LLM pretraining and image-classification performance in experiments.
2026-03-10	Mousse	Adds curvature-aware preconditioning to Muon so the optimizer does not apply uniform spectral steps across highly anisotropic curvature directions, reducing high-curvature instability and flat-direction underprogress.
2026-03-10	MOGA	Interprets AdamW, Muon, and related methods as steepest descent under mean-normalized matrix operator norms, deriving row/column-normalized updates for width-stable hyperparameter transfer.
2026-03-04	NuMuon	Adds a nuclear-norm constraint to Muon to encourage compressible low-rank weight structure while preserving Muon-style full-rank training benefits and favorable convergence behavior.
2026-02-28	Muon Simplicity Bias	Investigates downside biases introduced by Muon's speed-oriented spectral updates, analyzing cases where faster optimization can trade off with the simplicity bias often associated with generalization.
2026-02-28	MuonRec Code	Adapts Muon orthogonalized momentum to scalable recommendation training, challenging AdamW defaults and reporting fewer training steps plus improved ranking quality in generative recommender models.
2026-02-27	LoRA-Pre Code	Reinterprets optimizer momentum EMAs as online linear regressors, then stores Adam/Muon momentum in a low-rank LoRA-style subspace to reduce optimizer-state memory for pretraining and fine-tuning.
2026-02-26	LITE Code	Uses a Riemannian dynamics view to show Muon and SOAP can be too conservative in flat directions, then increases flat-direction damping/learning rates to accelerate LLM pretraining.
2026-02-26	FlashOptim Code	Reduces mixed-precision training memory by combining optimizer-state quantization with master-weight splitting while preserving API compatibility and model quality for large-model training.
2026-02-25	MUON+ Code	Identifies post-polar row/column imbalance after practical Newton-Schulz Muon steps and adds one normalization step to improve blockwise descent without changing Muon's core orthogonalized-momentum design.
2026-02-24	Spectral Conditions for muP	Extends maximal-update-parameterization ideas to spectral optimizers by deriving conditions for feature-learning hyperparameter transfer across Shampoo, Muon, and related matrix methods.
2026-02-19	ZO-Muon	Combines zeroth-order finite-difference gradient estimation with subspace projection and Muon-style orthogonalization, reducing query variance while exploiting low-rank update structure in memory-efficient fine-tuning.
2026-02-19	NAMO	Integrates Muon orthogonalized momentum with Adam-type noise adaptation through a norm-based adaptive stepsize, including a diagonal NAMO-D variant for additional stochastic stability.
2026-02-18	Adam/Muon Implicit Bias	Shows momentum steepest-descent methods such as Muon, Signum, and MomentumGD follow approximate norm-specific steepest-descent trajectories and converge toward KKT points of corresponding max-margin problems.
2026-02-18	SpecMuon	Adds spectral guidance and mode-wise RSAV step control to Muon for physics-informed neural networks and operators, tempering unit-singular-value steps in stiff multi-scale scientific-learning losses.
2026-02-17	Magma	Finds random update masking can regularize adaptive optimizers through curvature-dependent smoothing, then introduces momentum-aligned gradient masking as a dense-optimizer alternative that outperforms Adam/Muon in reported LLM pretraining.
2026-02-13	TrasMuon	Keeps Muon's near-isometric orthogonalized direction but restores magnitude control through global RMS calibration and energy-based trust-region clipping to reduce step-size sensitivity and high-energy bursts.
2026-02-12	Muon Quadratic Insights	Uses simple strongly convex quadratic examples to show local one-step proxies and worst-case polar-error bounds miss key Muon behavior, motivating a dynamical view of spectral optimizer trajectories.
2026-02-12	Mini-batch Steepest Descent Bias	Characterizes mini-batch stochastic steepest descent under entry-wise and Schatten-p norms, showing how batch size, momentum, and variance reduction shape max-margin implicit bias for SignSGD- and Muon-like methods.
2026-02-10	Clarifying Shampoo	Decomposes Shampoo updates into an adapted Muon-like spectral step and shows its stochastic/trajectory adaptation explains why Shampoo can be more token-efficient than plain Muon, paralleling Adam versus Signum.
2026-02-09	Pion/Leon	Builds adaptive operator-norm matrix online-learning algorithms by smoothing nuclear-norm potentials, yielding Pion/Leon-style methods with regret guarantees and nonsmooth nonconvex optimization rates.
2026-02-08	TSR-Adam	Introduces two-sided low-rank synchronization for Adam-family distributed training, communicating a compact U^T G V core to reduce bandwidth and memory relative to one-sided low-rank approaches.
2026-02-07	Sign-Based Heavy-Tail Optimizers	Explains empirical gains of Lion, Muon, and other sign-based optimizers through heavy-tailed gradient noise, giving generalized noise conditions under which sign updates outperform variance-adaptive methods.
2026-02-06	Unified Vector/Matrix Adaptivity	Decomposes AdaGrad into variance adaptation and scale-invariant update factors, using that split to bridge Adam-style vector adaptivity with Muon-style matrix spectral optimization.
2026-02-06	Muon LoRA Spectral Growth	Analyzes Muon/SpecGD dynamics in LoRA-style matrix factorization, showing LoRA product singular values grow nearly uniformly even when orthogonalization is applied separately to the low-rank factors.
2026-02-05	Norm-Constrained Warm-Up	Derives warm-up-then-decay schedules for norm-constrained optimizers such as Muon and Lion from a generalized smoothness assumption where curvature falls with suboptimality, replacing heuristic warm-up with adaptive scheduling theory.
2026-02-05	Muon Associative Memory	Studies Muon in a softmax linear associative-memory model with hierarchical frequency structure, showing spectral updates reduce frequency-dependent learning imbalance that slows gradient descent.
2026-02-05	ADANA	Introduces logarithmic-time schedules for AdamW momentum and weight decay, letting gradient-memory horizons grow with training time and using damping to stabilize improved language-model scaling.
2026-02-04	Canzona	Provides an asynchronous, load-balanced framework for distributed matrix optimizers such as Shampoo, Muon, and SOAP, reconciling holistic matrix updates with tensor sharding in Megatron-style training.
2026-02-04	BeyondMuon Code	Views Muon as the p=0 endpoint of U Sigma^p V^T spectral transformations and evaluates intermediate RMS/spectral variants to clarify how Muon relates to Adam-style adaptivity.
2026-02-03	PRISM Spectral Shaping	Adds low-rank quasi-second-order information to first-order spectral descent through innovation-augmented polar decomposition, suppressing high-variance subspaces while preserving signal-dominated directions with little overhead.
2026-02-03	Non-Euclidean GNS	Generalizes gradient-noise-scale batch-size adaptation to signSGD/Signum and spectral-descent/Muon geometries, replacing Euclidean SGD assumptions with norm-matched stochastic noise estimates.
2026-02-01	OLion	Combines Lion-style sign momentum with approximate Newton-Schulz orthogonalization and a final entrywise sign step, approximating steepest descent over intersected spectral and l-infinity constraints.
2026-01-30	Spectra	Identifies persistent spike-tail anisotropy in LLM gradients and proposes spike-aware spectral suppression so dominant low-rank directions do not throttle tail learning, stability, and downstream quality.
2026-01-30	TEON	Generalizes layer-wise Muon by treating network gradients as structured higher-order tensors and orthonormalizing across tensor modes, with convergence arguments and LLM pretraining experiments.
2026-01-30	Spectral GD Phase Retrieval	Uses anisotropic phase retrieval as a model problem to show spectral gradient updates can avoid misalignment caused by dominant covariance directions that distract ordinary gradient descent.
2026-01-30	Mano	Revisits manifold optimization for LLMs by projecting momentum onto tangent spaces and constraining updates on rotational oblique manifolds, seeking less memory/compute than AdamW while preserving curvature structure better than Muon.
2026-01-29	PRISM Matrix Functions	Accelerates matrix square roots, inverse roots, and orthogonalization used by Shampoo/Muon through adaptive polynomial fitting plus randomized iterative sketching, avoiding eigendecomposition while improving GPU efficiency.
2026-01-29	FISMO	Combines Fisher-structured adaptivity with Muon-style momentum orthogonalization so update spectra retain curvature information instead of being forced into strict isotropic singular values.
2026-01-29	MCSD/SPEL	Extends norm-constrained LMO optimizers such as spectral descent and Muon to manifold constraints with a single-loop method: choose a norm-induced Riemannian steepest direction and project back to the manifold.
2026-01-27	Muon Convergence Rates	Provides a simpler direct analysis of Muon for nonconvex optimization, improving convergence-rate guarantees without relying on restrictive assumptions about the orthogonalized update rule.
2026-01-27	Muon with Newton-Schulz	Analyzes practical Muon with finite Newton-Schulz orthogonalization, proving it matches ideal SVD-polar Muon convergence up to a constant that improves doubly exponentially with the number of NS steps.
2026-01-21	Variance-Adaptive Muon	Introduces Muon-NSR and Muon-VS, applying variance-adaptive normalization to momentum before orthogonalization to combine Adam-like stochastic robustness with Muon matrix geometry.
2026-01-20	Muon Spectral Orthogonalization	Studies simplified Muon in matrix factorization and linear-transformer in-context learning, giving end-to-end explanations of how spectral orthogonalization acts as useful preconditioning.
2026-01-18	IFNSO	Collapses iterative Newton-Schulz orthogonalization into an iteration-free polynomial formulation by analyzing matrix-power contributions, reducing repeated high-dimensional matmul overhead in Muon/Stiefel settings.
2026-01-13	Spectral Sphere Optimizer (SSO) Code	Constrains both module weights and updates on a spectral sphere, aligning Muon-style update control with muP-like activation stability so weights cannot drift while updates remain norm-controlled.
2026-01-04	Principled Muon under muP	Studies how to maintain muP spectral conditions throughout training with matrix optimizers such as Muon, aiming to preserve width-independent dynamics and hyperparameter transfer in practical runs.
2025-12-18	MVR-Gluon	Adds momentum variance reduction to the Gluon framework, giving one theory path that covers Muon, Scion, and other LMO optimizers and proves faster stochastic convergence than vanilla momentum.
2025-12-15	HCM-LMO	Injects Hessian-corrected momentum into arbitrary-norm LMO optimizers such as Muon, Scion, and Gluon, using second-order information to improve rates beyond standard stochastic momentum bounds.
2025-12-10	Fanions	Constructs Muon-like optimizers from duals of Ky-Fan norms and their Frobenius/l-infinity combinations, yielding Fanions, F-Muon, and S-Muon families that interpolate matrix-update geometries.
2025-12-05	Matrix-Preconditioned Hyperparameter Transfer	Studies learning-rate and weight-decay scaling for matrix-preconditioned optimizers such as Shampoo, SOAP, and Muon, using hyperparameter transfer to make gains consistent across Llama model sizes.
2025-12-04	Turbo-Muon	Preconditions the matrix before Newton-Schulz orthogonalization so Muon reaches useful polar accuracy with fewer matrix multiplications, making the overhead of acceleration nearly negligible.
2025-12-03	Spectral Gradient Conditions	Derives a layer-wise diagnostic comparing gradient nuclear/Frobenius ratio with activation stable rank to predict when a Muon-style spectral step should beat a Euclidean gradient step.
2025-10-07	NorMuon Code	Combines Muon orthogonalization with neuron-wise normalization and second-order statistics, targeting better scalability and efficiency by jointly leveraging Adam-like adaptivity and Muon matrix geometry.
2025-10-06	DP-Adam-AC	Adds adaptive clipping to differentially private Adam fine-tuning for localizable LLMs, improving the privacy-utility trade-off when task-specific models must be fine-tuned under privacy constraints.
2025-10-04	Hill-ADAM	Augments Adam with deterministic state-space exploration through alternating minimization/maximization phases, aiming to escape local minima rather than settling at the first basin visited.
2025-09-29	Conda	Combines Adam coordinate adaptivity with column-normalized updates, addressing Adam's low-rank/spectrally poor update structure while retaining per-coordinate variance scaling for faster LLM training.
2025-09-19	AdaGrad++	Proposes a simpler parameter-free AdaGrad variant with convergence guarantees, removing manual learning-rate tuning while preserving AdaGrad-like rates in convex optimization.
2025-09-19	Adam++	Proposes a parameter-free Adam variant with convergence guarantees that match Adam-style rates without assuming preselected learning-rate conditions, reducing manual tuning burden.
2025-09-03	KL-Shampoo / KL-SOAP Code	Recasts Shampoo/SOAP second-moment estimation as KL-divergence covariance estimation rather than Frobenius fitting, producing KL preconditioners that reduce Adam grafting or Adam-in-eigenbasis overhead.
2025-09-01	ZO Fine-Tuner	Learns a compact perturbation strategy for zeroth-order LLM fine-tuning, training the optimizer once per foundation model and reusing it across downstream tasks to beat hand-designed ZO baselines.
2025-09	SRON	Uses row-wise gradient normalization for state-free LLM training to reduce optimizer-state overhead while stabilizing matrix updates.
2025-07-15	AdaMuon	Applies element-wise second-moment scaling to Muon's orthogonalized directions, with sign-stabilized momentum and RMS-aligned rescaling to combine variance adaptivity with stable matrix geometry.
2025-06-20	SCALE Code	Finds column-wise gradient normalization plus last-layer momentum are sufficient minimalist changes to SGD for competitive LLM pretraining, matching Adam with much lower optimizer-state memory.
2025-06-08	SPlus Code	Stabilizes Shampoo-style whitening by combining historical eigenbases with instantaneous normalization and shape-aware scaling, reducing divergence from stale inverse caches and improving wall-clock efficiency.
2025-05-27	PolarGrad	Unifies matrix-aware preconditioned optimizers and introduces polar-decomposition update rules that explain links among Adam/Shampoo/Muon-style methods while improving convergence in reported experiments.
2025-05-19	Gluon	Bridges theory and practice for LMO-based optimizers by generalizing Muon and Scion into a layer-wise framework with better memory efficiency, hyperparameter transfer, and LLM training performance.
2025-04-07	Dion Code	Replaces Muon's dense Newton-Schulz orthogonalization with distributed amortized power iteration plus error feedback, enabling low-rank orthonormalized updates compatible with sharded LLM training.
2025-02-24	COSMOS Code	Splits matrix-gradient subspaces between SOAP and Muon: it applies richer adaptive preconditioning to leading eigendirections and cheaper Muon-style updates to the remaining space for memory-efficient LLM training.
2025-02-24	D-Muon Code	Scales Muon to larger LLM training by adding weight decay and per-parameter update-scale adjustment, reporting about 2x compute efficiency over AdamW under compute-optimal scaling and releasing the Moonlight recipe.
2025-02-11	Scion Code	Introduces stochastic LMO-over-norm-ball optimizers for unconstrained deep learning, choosing norms that improve memory efficiency and hyperparameter transfer while giving nanoGPT speedups.
2024-12-08	Muon Code	Defines Muon as SGD/Nesterov momentum followed by Newton-Schulz orthogonalization of 2D hidden-layer updates, usually paired with AdamW for embeddings, heads, and non-matrix parameters.
2024-12-02	PROFIT	Designs an optimizer specifically for deep fine-tuning converged models, using temporal gradient orthogonalization and assumptions about pretrained weights to regularize task adaptation beyond generic SGD/Adam.
2024-11-25	Cautious Optimizers Code	Adds a one-line cautious mask to momentum optimizers, preserving Adam-style Lyapunov convergence while empirically improving transformer training for C-AdamW, C-Lion, and related variants.
2024-11-15	MARS Code	Brings variance-reduction ideas into large-model optimization by correcting adaptive/sign optimizer updates with gradient-difference signals, producing MARS variants that improve GPT-style training efficiency.
2024-11-11	Subset-Norm + Subspace-Momentum Code	Combines Subset-Norm step sharing, which reduces AdaGrad memory from O(d) to O(sqrt(d)), with Subspace-Momentum to lower optimizer-state cost while retaining convergence guarantees.
2024-11-05	ADOPT Code	Modifies Adam so it converges at the optimal O(1/sqrt(T)) rate for any beta2 without bounded-gradient-noise assumptions, addressing Adam non-convergence while keeping adaptive-gradient practicality.
2024-10-25	COAT Code	Compresses optimizer states and activations into FP8 training through dynamic range expansion and related memory techniques, reducing memory footprint beyond FP8 linear-layer-only frameworks.
2024-10-21	LDAdam Code	Performs Adam-style adaptivity in changing low-dimensional projected subspaces, using projection-aware state updates and generalized error feedback to lower memory while still exploring full parameter space.
2024-09-17	SOAP Code	Shows Shampoo with half-power preconditioning is equivalent to Adafactor in Shampoo's eigenbasis, then stabilizes and simplifies Shampoo by combining it with Adam-style moment updates.
2024-09-05	AdEMAMix Code	Adds a second, slower EMA of older gradients to AdamW-style momentum so the optimizer can use both recent directions and longer-term gradient memory, improving token efficiency.
2024-06-24	Adam-mini Code	Cuts Adam memory by replacing most per-parameter second-moment learning rates with block-level rates chosen from Hessian-structure principles, retaining AdamW-like performance at about half memory.
2024-05-24	SF-AdamW (Schedule-Free) Code	Removes explicit learning-rate schedules by using schedule-free momentum/iterate-averaging theory, matching scheduled training performance without knowing the stopping step in advance.
2024-05-24	MicroAdam Code	Compresses gradients before feeding them into Adam states and uses compressed error feedback, reducing optimizer-state memory while preserving convergence guarantees.
2024-05-21	FAdam Code	Interprets Adam as a diagonal empirical-Fisher natural-gradient method, identifies approximation flaws in standard Adam, and proposes corrections to momentum, bias correction, and epsilon handling.
2023-12-04	AGD Code	Builds an auto-switchable preconditioner from stepwise gradient differences, using Hessian-related diagonal information to choose between adaptive and non-adaptive behavior during training.
2023-10-16	AdaLOMO Code	Adds adaptive learning rates to low-memory full-parameter LLM fine-tuning, keeping LOMO-style memory savings while closing much of the convergence gap to AdamW.
2023-09-05	AdaPlus Code	Combines AdamW, NAdam, and AdaBelief ideas by adding Nesterov momentum and precise step-size adjustment without extra hyperparameters, improving language-model and vision training baselines.
2023-07-28	CoRe Code	Benchmarks Continual Resilient optimization as a robust all-in-one first-order optimizer, emphasizing smooth convergence, low tuning burden, and broad applicability across machine-learning tasks.
2023-07-18	Adam+CM Code	Augments Adam with a memory buffer of critical momentum terms that intentionally overshoot narrow minima, encouraging exploration toward flatter basins and better generalization.
2023-07-05	CAME Code	Uses confidence-guided memory-efficient second-moment estimation to reduce instability in Adafactor-like optimizers, aiming for Adam/LAMB quality with much lower auxiliary memory.
2023-06-16	LOMO Code	Fuses gradient computation and parameter updates so full-parameter LLM fine-tuning can run under limited GPU memory, avoiding the optimizer-state cost of AdamW-style training.
2023-06-09	Prodigy Code	Improves D-Adaptation by estimating the distance-to-solution parameter needed for optimal learning rates, yielding a parameter-free optimizer that matches tuned methods across many tasks.
2023-05-25	WSAM Code	Recasts SAM sharpness as an explicit weighted regularization term, deriving generalization bounds and improving or matching standard optimizers on benchmark datasets.
2023-05-25	DoWG Code	Introduces Distance-over-Weighted-Gradients, a parameter-free optimizer that adapts to smooth and nonsmooth convex problems using a distance-weighted gradient accumulator.
2023-05-23	Sophia Code	Uses a lightweight diagonal Hessian estimate as a stochastic second-order preconditioner with clipping, reducing language-model pretraining cost relative to Adam-family methods.
2023-05-09	UAdam	Provides a unified Adam-type framework covering Adam, NAdam, AMSGrad, AdaBound, AdaFom, and Adan through a general second-moment form with nonconvex stochastic convergence analysis.
2023-02-16	FOSI Code	Wraps any first-order optimizer with low-dimensional second-order corrections by splitting the objective into orthogonal subspaces, using curvature where useful and the base optimizer elsewhere.
2023-02-13	Lion Code	Discovers a memory-efficient sign-momentum optimizer through symbolic program search, keeping only momentum state and using update signs rather than Adam-style second moments.
2023-02-08	DoG Code	Sets SGD step sizes dynamically from distance-to-initial-point and gradient norms, removing the learning-rate hyperparameter while matching tuned SGD on vision and language transfer tasks.
2023-01-18	D-Adaptation Code	Automatically estimates the learning-rate scale for SGD/Adam/AdaGrad variants without line searches or extra gradients, matching hand-tuned rates across diverse convex and ML experiments.
2022-11-17	VeLO Code	Meta-trains a neural-network optimizer at large scale across many tasks, producing a versatile learned optimizer that transfers with little tuning and exhibits non-hand-designed update behavior.
2022-10-21	Amos Code	Adds model-oriented adaptive learning-rate decay and weight decay to Adam-style optimization, improving BERT/T5 pretraining speed while using less slot-variable memory than AdamW.
2022-10-12	AdaNorm Code	Corrects each iteration's gradient norm using adaptive history so SGD/momentum-style optimizers maintain representative update magnitudes and improve CNN convergence.
2022-08-13	Adan Code	Derives an adaptive Nesterov momentum estimator that avoids extra extrapolation gradients and estimates first/second moments for faster, robust deep-model training.
2022-06-14	GradaGrad	Modifies AdaGrad's monotonically shrinking denominator so the effective learning rate can both grow and shrink, preserving similar convergence rates while reducing tuning sensitivity.

Weakly Related / Adjacent Muon Papers

These papers mention, compare, or rely on Muon-style optimization, but their main contribution is broader than a standalone Muon-family optimizer.

CSV: data/muon_weakly_related.csv

Date	Paper	Relation
2026-06-14	Schattor	Introduces a Schatten-norm optimizer family that unifies SGD and Muon-like matrix updates with dimension-free stationarity theory; it is adjacent because the contribution is a broader optimizer framework, not a Muon variant.
2026-06-13	When to use Schatten-p Norm	Analyzes which Schatten-p geometries are optimal under different scaling/noise regimes, explaining when Muon-like Schatten-infinity updates help and when smaller-p geometries are preferable.
2026-06-12	ZO Parameter-free LMO	Combines zeroth-order fine-tuning with parameter-free LMO methods to reduce memory and tuning burden, borrowing Muon-style geometry but focusing on ZO/PF optimization broadly.
2026-06-12	Zeta	Diagnoses coordinate-scale heterogeneity in matrices before Newton-Schulz, then proposes coordinate-adaptive dual whitening; Muon appears mainly as a matrix-aware baseline and motivation.
2026-06-11	Different Layers Different Manifolds	Tests module-wise manifold assignments for GPT-2 training, finding Stiefel constraints suit attention while DGram suits MLPs; Muon is used through a Manifold Muon lens rather than modified directly.
2026-06-09	Overcoming Rank Collapse in Feedback Alignment	Studies why feedback alignment fails in deeper networks and uses orthogonalized/Muon-style updates to increase feedback-signal rank, but the target is biologically plausible learning, not optimizer design.
2026-06-04	PC Layer	Adds a polynomial weight-preconditioning layer that reshapes singular spectra during LLM pretraining and can be merged away at inference; Muon is one optimizer tested with the architectural preconditioner.
2026-06-04	Double Preconditioning	Optimizes for rollout/test-time performance under feedback mismatch rather than validation loss, treating Muon as one possible gradient-wise preconditioner inside a broader DoPr framework.
2026-06-02	Ultralytics YOLO26	Presents a real-time vision model family with NMS-free heads and training changes; Muon-SGD appears as part of the recipe, but the main contribution is architecture/system design.
2026-06-01	WALL-WM	Builds event-grounded world-action-model pretraining for VLA/video-action learning; Muon is part of the large-scale training stack rather than the object of study.
2026-05-30	Exploiting Weight-Space Symmetries for Approximating Curvature	Uses weight-space symmetry averaging to construct tractable curvature approximations from single gradients, recovering Shampoo/Muon-like structures as cases of a broader symmetry framework.
2026-05-29	Mellum2 Technical Report	Reports a 12B MoE code/general model and mentions Muon in the FP8 hybrid-precision training recipe; it is model-report evidence of Muon use, not an optimizer contribution.
2026-05-28	On the Optimizer Dependence of Neural Scaling Laws	Shows scaling-law exponents vary with optimizer choice in random-feature regression and related settings, using Matrix-Sign/Muon-like methods as one optimizer family under comparison.
2026-05-27	Parallax	Introduces parameterized local linear attention for language modeling and uses Muon to stabilize/scale training; the optimizer is enabling infrastructure, not the paper's main method.
2026-05-26	How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks	Compares Adam and Muon across equivariant/geometric models, finding Muon often improves optimization; it analyzes optimizer effects rather than proposing a new optimizer.
2026-05-26	The Stability of Singular Distribution	Identifies early stabilization of trace-normalized singular spectra during LLM pretraining across schedules and optimizers including Muon, connecting spectral dynamics to two-phase loss curves.
2026-05-23	Momentum Streams for Optimizer-Inspired Transformers	Interprets Transformer residual updates as optimizer steps and designs architectures inspired by momentum, Adam, Muon, and SOAP; the contribution is architectural, not an optimizer update rule.
2026-05-20	Same Architecture, Different Capacity	Measures FFN representation spectra and shows AdamW and Muon produce different spectral scaling laws at fixed architecture, framing optimizer choice as a capacity-scaling axis.
2026-05-18	Scale-Invariant Neural Network Optimization	Develops theory for scale-invariant optimizers under norm geometry and heavy-tailed noise, including Muon and Scion as examples within a broader optimization principle.
2026-05-09	Navigating LLM Valley	Surveys LLM optimizer design from AdamW through memory-efficient and matrix-based methods, positioning Muon within the broader optimizer landscape rather than introducing a method.
2026-05-07	Optimizer-Model Consistency	Shows full fine-tuning with the same optimizer used in pretraining can reduce forgetting, comparing Muon and AdamW behavior as evidence for optimizer-shaped model states.
2026-04-16	Benchmarking Optimizers for MLPs in Tabular Deep Learning	Benchmarks 15 optimizers for tabular MLPs and finds Muon strong against AdamW; it is empirical optimizer evaluation rather than new algorithm design.
2026-04-02	Normalization-Optimizer Coupling	Tests normalization-layer choices with AdamW and Muon at 1B scale, showing Dynamic Erf interacts badly with Muon while RMSNorm/DyT behave differently.
2026-03-30	Spectral Edge Dynamics	Uses rolling-window spectra of parameter updates to analyze phase transitions such as grokking and loss plateaus; optimizer spectra are diagnostic tools, not new update rules.
2026-02-25	veScale-FSDP	Improves FSDP/ZeRO infrastructure for block-structured computations and non-element-wise optimizers such as Shampoo and Muon, focusing on sharding/runtime support.
2026-02-07	Robust Scaling Laws for Optimizers	Studies Chinchilla-style and optimizer-specific scaling laws across AdamW, Muon, Shampoo, SOAP, and others, asking how optimizer choice changes compute/data/model scaling.
2026-01-31	Data Distribution as an Optimizer Lever	Analyzes whether changing training data distribution can steer optimizer generalization behavior, comparing GD and SAM; it is adjacent to optimizer choice but not Muon-specific.
2026-01-14	Muon-Optimized Distillation and Quantization	Combines GPTQ quantization, LoRA, distillation, and Muon-based fine-tuning in a deployment pipeline for compressed LLMs, using Muon as a component rather than proposing it.
2026-01-08	Learnable Multipliers	Adds learnable scalar multipliers to matrix layers to escape weight-decay/noise equilibrium norms, validating the scaling idea under Adam and Muon.
2025-12-16	Optimizing Rank for INRs	Argues vanilla MLP INRs fail at high frequencies due to stable-rank degradation and uses Muon-like high-rank updates/rank regularization to preserve representation rank.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Optimizers List

Weakly Related / Adjacent Muon Papers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Optimizers List

Weakly Related / Adjacent Muon Papers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages