Thank you for sharing this exciting work! I particularly agree with your approach to combining quality and speedy modes.
In XDLM, we have also explored combining T2T and M2T via a stationary noise kernel and observed the same phenomena of unmasking, refinement, and remasking.
The left part of the image below shows how XDLM combines the noise kernels of UDLM (u) and MDLM (m) to achieve a favorable trade-off between the two methods. [NORMAL] denotes standard tokens, while [MASK] represents the mask token.
The right part illustrates the trade-off between understanding capability (zero-shot perplexity) and generation capability (generation perplexity at 32 sampling steps). The proposed XDLM with a mixing ratio of 0.1 achieves the optimal balance, labeled as the Sweet Spot.
With the unified stationary noise kernel, we derived the posterior probability and KL divergence along with its limiting case:
The figure below shows the part of the step-wise evolution of a generated sequence (T = [8/32]). XDLM shows three different transition dynamics inherent to the hybrid noise process: Green represents new tokens generated from masks; Blue represents lexical refinement; and Red highlights the re-masking operation where previously generated tokens are rejected and reverted to [MASK].
This phenomenon is also observed in image generation:
When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively
doubling the baseline performance. Below shows the LLaDA-XDLM with sampling budget of 32. Evaluation of adapting LLaDA-8B to our XDLM for- mulation (LLaDA-XDLM): (a) LLaDA-XDLM consistently out-performs baselines across diverse benchmarks with 32 sampling steps; (b) Improvements are particularly pronounced in code generation (MBPP), where the model substantially reduces generation failures.
Please refer to https://github.com/MzeroMiko/XDLM and https://github.com/MzeroMiko/LLaDA-XDLM for details.
Thank you for sharing this exciting work! I particularly agree with your approach to combining
qualityandspeedymodes.In
XDLM, we have also explored combiningT2TandM2Tvia astationary noise kerneland observed the same phenomena ofunmasking,refinement, andremasking.The left part of the image below shows how
XDLMcombines the noise kernels ofUDLM (u)andMDLM (m)to achieve a favorable trade-off between the two methods.[NORMAL]denotes standard tokens, while[MASK]represents the mask token.The right part illustrates the trade-off between understanding capability (zero-shot perplexity) and generation capability (generation perplexity at 32 sampling steps). The proposed
XDLMwith a mixing ratio of 0.1 achieves the optimal balance, labeled as theSweet Spot.With the unified
stationary noise kernel, we derived theposterior probabilityandKL divergencealong with itslimiting case:The figure below shows the part of the step-wise evolution of a generated sequence (T = [8/32]).
XDLMshows three different transition dynamics inherent to the hybrid noise process:Greenrepresents new tokens generated from masks;Bluerepresents lexical refinement; andRedhighlights the re-masking operation where previously generated tokens are rejected and reverted to[MASK].This phenomenon is also observed in image generation:
When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively
doubling the baseline performance. Below shows the LLaDA-XDLM with sampling budget of 32. Evaluation of adapting LLaDA-8B to our XDLM for- mulation (LLaDA-XDLM): (a) LLaDA-XDLM consistently out-performs baselines across diverse benchmarks with 32 sampling steps; (b) Improvements are particularly pronounced in code generation (MBPP), where the model substantially reduces generation failures.
Please refer to https://github.com/MzeroMiko/XDLM and https://github.com/MzeroMiko/LLaDA-XDLM for details.