Congratulations and introducing a similar work: XDLM

Thank you for sharing this exciting work! I particularly agree with your approach to combining `quality` and `speedy` modes.

In `XDLM`, we have also explored combining `T2T` and `M2T` via a `stationary noise kernel` and observed the same phenomena of `unmasking`, `refinement`, and `remasking`.

The left part of the image below shows how `XDLM` combines the noise kernels of `UDLM (u)` and `MDLM (m)` to achieve a favorable trade-off between the two methods. `[NORMAL]` denotes standard tokens, while `[MASK]` represents the mask token.
The right part illustrates the trade-off between understanding capability (zero-shot perplexity) and generation capability (generation perplexity at 32 sampling steps). The proposed `XDLM` with a mixing ratio of 0.1 achieves the optimal balance, labeled as the `Sweet Spot.`
<div align=center>
<img src="https://github.com/user-attachments/assets/a8bfaab0-a07b-45c2-9f73-dcb13e739558" width="80%" />
</div>

With the unified `stationary noise kernel`, we derived the `posterior probability` and `KL divergence` along with its `limiting case`:
<div align=center>
<img src="https://github.com/user-attachments/assets/4d637d9f-c1f8-4be4-ac7b-57f603b97db0" width="100%" />
</div>
<div align=center>
<img src="https://github.com/user-attachments/assets/9286cbfa-21d4-4bfe-9a31-756158d95b5a" width="100%" />
</div>
<div align=center>
<img src="https://github.com/user-attachments/assets/40dc2182-9b38-4a52-a002-807b3d8e5e62" width="100%" />
</div>

The figure below shows the part of the step-wise evolution of a generated sequence (T = [8/32]). `XDLM` shows three different transition dynamics inherent to the hybrid noise process: `Green` represents new tokens generated from masks; `Blue` represents lexical refinement; and `Red` highlights the re-masking operation where previously generated tokens are rejected and reverted to `[MASK]`.
<div align=center>
<img  src="https://github.com/user-attachments/assets/f6b14be7-6bb0-41b2-b78f-dd31e9381a26" width="80%"/>
</div>

This phenomenon is also observed in image generation:
<div align=center>
<img src="https://github.com/user-attachments/assets/fb6a856b-4df7-4073-bb3d-27f5a5bdb23f" width="80%" />
</div>


When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively
doubling the baseline performance. Below shows the LLaDA-XDLM with sampling budget of 32. Evaluation of adapting LLaDA-8B to our XDLM for- mulation (LLaDA-XDLM): (a) LLaDA-XDLM consistently out-performs baselines across diverse benchmarks with 32 sampling steps; (b) Improvements are particularly pronounced in code generation (MBPP), where the model substantially reduces generation failures.
<div align=center>
<img  src="https://github.com/user-attachments/assets/2ed76823-6f93-458f-868b-c23f610db3d1" width="80%" />
</div>

Please refer to [https://github.com/MzeroMiko/XDLM](https://github.com/MzeroMiko/XDLM) and [https://github.com/MzeroMiko/LLaDA-XDLM](https://github.com/MzeroMiko/LLaDA-XDLM) for details.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Congratulations and introducing a similar work: XDLM #8

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Congratulations and introducing a similar work: XDLM #8

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions