Fix MetaX vLLM 0.23 MTP compatibility by xzh25 · Pull Request #294 · MetaX-MACA/vLLM-metax

xzh25 · 2026-06-22T07:38:24Z

Summary

Fix MetaX plugin startup for vLLM 0.23 Qwen3.5/Qwen3 MTP speculative decoding by installing early compatibility hooks before plugin-specific patches and vLLM compilation/fusion modules are imported.

Root cause

The vLLM 0.23 MTP path imports quantization/fusion modules early, including vllm._custom_ops and vllm.compilation.passes.fusion.act_quant_fusion. On the current MetaX runtime those imports can reference _C::scaled_fp4_quant.out, _C::silu_and_mul_per_block_quant, and newer torch.accelerator helpers before the MetaX plugin has registered compatible shims. In spawned EngineCore workers this prevents Qwen3.5 MTP from reaching generation.

Changes

Add vllm_metax.compat and import it at package import time.
Register fragment schemas for the missing import-time _C custom ops.
Backfill missing torch.accelerator helpers from torch.cuda when that module exists.
Keep the legacy triton_support import path loading the same compatibility hooks.
Add focused regression tests covering act_quant_fusion import and import safety when torch.cuda is absent.

Before this fix

On the same MetaX validation environment, MTP could fail before generation:

079_qwen35_9b_mtp261_smoke.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist
084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt: AttributeError: module 'torch.accelerator' has no attribute 'empty_cache'
088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist

Validation

Remote MetaX C500 environment:

Driver: 3.8.30
MACA SDK: 3.5.3.20
Python: 3.10.10
vLLM: 0.23.0
torch: 2.8.0+metax3.5.3.9

Checks:

python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch -> 2 passed
import vllm_metax.quant_config; import vllm.compilation.passes.fusion.act_quant_fusion -> ok
python -m py_compile ... -> ok
git diff --check -> ok

Validation matrix using the same prompt and max_tokens=96:

Case	MTP	Load s	Generate s	Output tok/s	Repeated bigram ratio	Spec decode
Qwen3.5-9B	no	90.289	3.995	24.0311	0.0125	n/a
Qwen3.5-9B	yes	93.406	3.352	28.6370	0.0125	drafts=45, draft_tokens=90, accepted=50, per_pos=[30, 20]
Qwen3.5-27B-W8A8	no	99.703	9.338	10.2801	0.0444	n/a
Qwen3.5-27B-W8A8	yes	102.532	4.434	21.6529	0.0000	drafts=35, draft_tokens=70, accepted=60, per_pos=[31, 29]

Both no-MTP and MTP paths generated normal output. The MTP cases also reported active spec-decode metrics, confirming the qwen3_next_mtp/mtp path was exercised rather than falling back to regular decoding.

Logs are archived on the validation machine under /data/vllm-metax-contribution/ops/logs/079_*, 084_*, 088_*, 092_*, and 093_*.

Notes:

The available 30B-A3B model configs on the validation machine do not contain MTP metadata, so they were not used as Qwen3.5 MTP substitutes.
Qwen3.5-27B-W8A8 has a TokenizersBackend tokenizer config that this Transformers environment cannot instantiate directly; the 27B smoke/validation runs use the local Qwen3.5-9B tokenizer workaround.

Additional 35BA3B GPTQ validation

After the original import compatibility fix, I also downloaded and tested a ModelScope 35BA3B GPTQ checkpoint because the validation machine did not initially have a runnable 35BA3B MTP model:

Downloaded: potter001/gptq-Qwen3.6-35B-A3B-4bit-group to /data/vllm-metax-contribution/models/gptq-Qwen3.6-35B-A3B-4bit-group
Model metadata: qwen3_5_moe, mtp_num_hidden_layers=1, quant_method=gptq, 7 safetensors shards, 19.27 GiB after removing Git LFS cache
Also checked but did not use for final validation:
- official FP8: MetaX rejects fp8 quantization is currently not supported in maca
- Eco-Tech W8A8: Ascend/msmodelslim format lacks vLLM quantization config and OOMs as unquantized MoE
- FlagRelease Metax BF16: ~71.9 GiB and documented for TP=2, not usable on the single C500 validation setup

The GPTQ 35BA3B path exposed one additional vLLM 0.23 compatibility issue: MoeWNA16Method.apply() read self.moe.disable_inplace, but the current FusedMoEConfig does not always define that field. This PR now falls back to False when the field is absent and adds a focused regression test.

35BA3B GPTQ + MTP validation after that fix:

python -m pytest -q --confcutdir=tests/patch tests/patch/test_triton_custom_op_schemas.py tests/patch/test_moe_wna16_compat.py: 3 passed
Smoke log: /data/vllm-metax-contribution/ops/logs/103_qwen36_35ba3b_gptq_mtp_after_moe_wna16_fix.txt
- loaded target + MTP drafter, LOAD_S 130.03
- generated 32 tokens, drafts=31, draft_tokens=62, accepted=0
Issue-like long check: /data/vllm-metax-contribution/ops/logs/104_qwen36_35ba3b_gptq_mtp_issue_like_check.txt
- generated 128 tokens, OUTPUT_TOKENS_PER_S 9.2379
- repetition ratios: bigram 0.0806, trigram 0.0492, 4-gram 0.0167
- spec-decode metrics: drafts=127, draft_tokens=254, accepted=0, accepted_tokens_per_pos=[0, 0]

The 35BA3B GPTQ output did not show the repeated corrupted loop reported in #261. The zero accepted-token count is called out explicitly because this GPTQ checkpoint plus tokenizer workaround exercises the MTP drafter path but does not demonstrate speedup.

Related to #261. This PR fixes MTP compatibility blockers found while reproducing #261. It does not claim to close the original repeated-output corruption without the exact original reproduction prompt and sampling parameters.

2026-07-02 dual-C500 FlagRelease 35BA3B validation

Validation environment:

2 x MetaX C500 64GB
MACA 3.5.3.20, driver 3.8.30
torch 2.8.0+metax3.5.3.9
vLLM 0.23.0
model: FlagRelease/Qwen3.6-35B-A3B-nomtp-metax-FlagOS
model weights: 21 safetensors, 71,903,776,768 bytes
tensor_parallel_size=2, max_model_len=2048, temperature=0.0, max_tokens=128

Before/after compatibility check:

Before this patch, the base ce08bf57 worktree fails while importing the MTP/triton-support path with RuntimeError: operator _C::scaled_fp4_quant.out does not exist.
After this patch, vllm_metax.patch.bugfix.triton_support.custom_op_schemas and vllm.compilation.passes.fusion.act_quant_fusion import successfully; _C::scaled_fp4_quant.out and _C::silu_and_mul_per_block_quant schemas are present.
Focused tests: python -m pytest -q --confcutdir=tests/patch tests/patch/test_triton_custom_op_schemas.py tests/patch/test_moe_wna16_compat.py -> 3 passed.

35BA3B MTP/no-MTP comparison on the same prompt:

Case	MTP	num_speculative_tokens	output tok/s	repeat 2g	accepted
no-MTP	no	n/a	7.3670	0.0267	n/a
MTP	yes	2	12.8181	0.0267	79/98, per pos `[46, 33]`
MTP	yes	1	14.1913	0.0267	63/65, per pos `[63]`

The MTP runs initialized as SpeculativeConfig(method='mtp', num_spec_tokens=...), loaded the MTP drafter, emitted vllm:spec_decode_* metrics, and did not show the repeated corrupted-output loop in this validation prompt.

gemini-code-assist

Code Review

This pull request introduces early compatibility hooks in vllm_metax/compat.py to address runtime mismatches between vLLM and MetaX, including defining missing custom operator schemas and mapping torch.cuda attributes to torch.accelerator when applicable. It also adds a test to verify these compatibility imports. Feedback was provided to safely retrieve the torch.cuda module using getattr to prevent potential AttributeError exceptions in CPU-only or non-CUDA environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-22T07:39:37Z

+if hasattr(torch, "accelerator"):
+    for _name in (
+        "current_device",
+        "device_count",
+        "empty_cache",
+        "is_available",
+        "max_memory_allocated",
+        "mem_get_info",
+        "memory_allocated",
+        "memory_reserved",
+        "memory_stats",
+        "reset_peak_memory_stats",
+        "set_device",
+        "synchronize",
+    ):
+        if not hasattr(torch.accelerator, _name) and hasattr(torch.cuda, _name):
+            setattr(torch.accelerator, _name, getattr(torch.cuda, _name))
+    if (
+        not hasattr(torch.accelerator, "current_device_index")
+        and hasattr(torch.cuda, "current_device")
+    ):
+        torch.accelerator.current_device_index = torch.cuda.current_device


Directly accessing torch.cuda can raise an AttributeError in environments where CUDA is not available or not compiled (such as CPU-only environments or certain specialized builds). To prevent import-time failures, retrieve the cuda module safely using getattr(torch, "cuda", None) before checking or copying its attributes.

if hasattr(torch, "accelerator"): cuda_module = getattr(torch, "cuda", None) if cuda_module is not None: for _name in ( "current_device", "device_count", "empty_cache", "is_available", "max_memory_allocated", "mem_get_info", "memory_allocated", "memory_reserved", "memory_stats", "reset_peak_memory_stats", "set_device", "synchronize", ): if not hasattr(torch.accelerator, _name) and hasattr(cuda_module, _name): setattr(torch.accelerator, _name, getattr(cuda_module, _name)) if ( not hasattr(torch.accelerator, "current_device_index") and hasattr(cuda_module, "current_device") ): torch.accelerator.current_device_index = cuda_module.current_device

Addressed in ff67715.

I changed the compat hook to use cuda_module = getattr(torch, "cuda", None) before copying CUDA helpers onto torch.accelerator, so importing vllm_metax.compat will not assume a torch.cuda attribute exists.

Validated on the MetaX C500 environment:

python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch -> 1 passed

python -m py_compile vllm_metax/compat.py -> ok

git diff --check -> ok

xzh25 · 2026-06-22T08:19:02Z

Additional MetaX validation completed after moving this PR to ready-for-review.

Environment captured in /data/vllm-metax-contribution/ops/logs/092_collect_env_mtp261.txt:

GPU: MetaX C500
Driver: 3.8.30
MACA SDK: 3.5.3.20
Python: 3.10.10
vLLM: 0.23.0
torch: 2.8.0+metax3.5.3.9
vLLM-metax package baseline: 0.17.0+gd10261.d20260409.maca3.5.3.20.torch2.8

Validation matrix saved in /data/vllm-metax-contribution/ops/logs/093_mtp261_validation_matrix.txt using the same prompt and max_tokens=96:

Case	MTP	Load s	Generate s	Output tok/s	Repeated bigram ratio	Spec decode
Qwen3.5-9B	no	90.289	3.995	24.0311	0.0125	n/a
Qwen3.5-9B	yes	93.406	3.352	28.6370	0.0125	drafts=45, draft_tokens=90, accepted=50, per_pos=[30, 20]
Qwen3.5-27B-W8A8	no	99.703	9.338	10.2801	0.0444	n/a
Qwen3.5-27B-W8A8	yes	102.532	4.434	21.6529	0.0000	drafts=35, draft_tokens=70, accepted=60, per_pos=[31, 29]

Both no-MTP and MTP paths generated normal output. The MTP cases also reported active spec-decode metrics, which confirms the qwen3_next_mtp/mtp path was exercised rather than falling back to regular decoding.

Notes:

The available 30B-A3B model configs on this validation machine do not contain MTP metadata, so they were not used as substitutes for the Qwen3.5 MTP models.
Qwen3.5-27B-W8A8 has a TokenizersBackend tokenizer config that this Transformers environment cannot instantiate directly; the 27B smoke/validation runs use the local Qwen3.5-9B tokenizer workaround.

xzh25 · 2026-06-22T08:25:50Z

Added one more regression test in 0466e6f for the review feedback path:

test_compat_import_tolerates_missing_torch_cuda temporarily removes torch.cuda and reloads vllm_metax.compat, covering CPU-only/specialized torch builds where the attribute may be absent.

Revalidated on the MetaX environment:

python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch -> 2 passed
python -m py_compile tests/patch/test_triton_custom_op_schemas.py vllm_metax/compat.py -> ok
git diff --check -> ok

I also updated the PR description with the pre-fix failure logs and the latest validation matrix.

xzh25 · 2026-06-22T08:29:44Z

I also ran a longer MTP output stress check to look specifically for the repeated/damaged-output symptom from #261.

Log: /data/vllm-metax-contribution/ops/logs/094_qwen35_27b_w8a8_mtp_long_output_check.txt

Configuration:

Model: Qwen3.5-27B-W8A8
Tokenizer workaround: local Qwen3.5-9B tokenizer
speculative_config={"method":"qwen3_next_mtp","num_speculative_tokens":2}
max_tokens=256

Result:

Generated 256 output tokens normally
Output throughput: 23.2837 tok/s
Spec decode metrics: drafts=99, draft_tokens=198, accepted=158, accepted_tokens_per_pos=[85, 73]
Repetition metrics: bigram repeat ratio 0.0853, trigram 0.0391, 4-gram 0.0157
Most repeated 4-gram was Genshin Impact Version 5.0 with count 3, which is attributable to the requested topic rather than a looped/damaged output pattern

No repeated-output loop was observed in this longer MTP run after the compatibility fix.

xzh25 · 2026-06-22T10:02:08Z

Added one more compatibility fix and validation pass for 35BA3B MTP.

New commit: 82fccb2 Handle missing MoE disable_inplace config

What changed:

MoeWNA16Method.apply() now tolerates vLLM 0.23 FusedMoEConfig objects that do not define disable_inplace, defaulting to the existing in-place behavior.
Added tests/patch/test_moe_wna16_compat.py to cover that compatibility case.

Validation on the MetaX C500 machine:

python -m pytest -q --confcutdir=tests/patch tests/patch/test_triton_custom_op_schemas.py tests/patch/test_moe_wna16_compat.py -> 3 passed
Downloaded and tested potter001/gptq-Qwen3.6-35B-A3B-4bit-group from ModelScope because the machine did not initially have a runnable 35BA3B MTP model.
35BA3B GPTQ + MTP smoke after the fix: target + MTP drafter loaded, generated 32 tokens, spec metrics reported drafts=31, draft_tokens=62, accepted=0.
35BA3B GPTQ + MTP issue-like 128-token check: generated normally at 9.2379 tok/s; repeat ratios were bigram 0.0806, trigram 0.0492, 4-gram 0.0167; spec metrics reported drafts=127, draft_tokens=254, accepted=0.

Logs:

/data/vllm-metax-contribution/ops/logs/103_qwen36_35ba3b_gptq_mtp_after_moe_wna16_fix.txt
/data/vllm-metax-contribution/ops/logs/104_qwen36_35ba3b_gptq_mtp_issue_like_check.txt

The 35BA3B GPTQ output did not reproduce the corrupted repeated-output loop. The accepted-token count is still zero for this GPTQ checkpoint plus tokenizer workaround, so I am treating this as functional MTP-path coverage rather than a speedup claim.

xzh25 · 2026-06-22T12:01:49Z

Additional validation update:

OpenAI-compatible server API path is now covered on the MetaX C500 machine using Qwen3.5-9B + MTP:

vllm serve ... --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
/v1/models returned the served model.
/v1/completions returned a completion successfully.
/metrics exposed spec-decode counters: drafts=18, draft_tokens=36, accepted_tokens=31, accepted_tokens_per_pos=[17, 14].
Log: /data/vllm-metax-contribution/ops/logs/105_qwen35_9b_mtp_openai_server_api.txt

I also checked whether the 35BA3B GPTQ accepted-token issue was just caused by num_speculative_tokens=2. With num_speculative_tokens=1, the model still generated normally but accepted tokens stayed at zero:

drafts=63, draft_tokens=63, accepted=0, accepted_tokens_per_pos=[0]
Log: /data/vllm-metax-contribution/ops/logs/106_qwen36_35ba3b_gptq_mtp_one_token_check.txt

So the server API validation gap is closed for a known-good MTP model. The remaining 35BA3B limitation appears checkpoint/tokenizer-specific rather than a simple spec-token-count parameter issue.

xzh25 · 2026-06-22T14:39:35Z

Follow-up validation on the official/FlagRelease 35BA3B dual-card path is now complete.

Environment:

2x MetaX C500 64GB, MACA 3.5.3.20
vLLM 0.23.0, tensor_parallel_size=2, CUDA_VISIBLE_DEVICES=0,1
model: FlagRelease/Qwen3.6-35B-A3B-nomtp-metax-FlagOS
model integrity: 21 safetensors shards, 71,903,776,768 safetensors bytes, 19 indexed MTP weights, no missing shards

MTP result, max_tokens=128:

target model + MTP drafter loaded successfully (Qwen3_5MoeForConditionalGeneration + Qwen3_5MoeMTP)
drafts=49, draft_tokens=98, accepted_tokens=79, accepted_tokens_per_pos=[46, 33]
GEN_S=6.293, OUTPUT_TOKENS_PER_S=20.3389

No-MTP comparison, same prompt/model/tokenizer/TP=2, max_tokens=128:

GEN_S=12.054, OUTPUT_TOKENS_PER_S=10.619

So the official/FlagRelease 35BA3B TP=2 path does show accepted-token MTP acceleration in this setup: about 1.915x output-token throughput for this prompt. This also explains the earlier single-card GPTQ 35BA3B result more narrowly: that checkpoint/tokenizer/quantization path ran functionally but had accepted_tokens=0, while the official dual-card checkpoint produces accepted tokens normally.

Remote logs:

/data/vllm-metax-contribution/ops/logs/202_dual_c500_flagrelease_35ba3b_mtp_smoke.log
/data/vllm-metax-contribution/ops/logs/203_dual_c500_flagrelease_35ba3b_no_mtp_smoke.log
/data/vllm-metax-contribution/ops/logs/204_dual_c500_flagrelease_35ba3b_mtp_128.log
/data/vllm-metax-contribution/ops/logs/205_dual_c500_flagrelease_35ba3b_no_mtp_128.log

Contest material update pushed: https://github.com/xzh25/vllm-metax-qwen3-mtp261/commit/fd1d4a7

ILikeIneine · 2026-07-02T07:19:29Z

Sorry for the late response. First thanks for you effort! vllm-metax requires maca3.8.0.x after v0.21.0. So all the work and validations on maca3.7(and lower) is stopped. But still, maca3.8 is not offically released yet, so only the internal developers can have it tested for latest vllm for now.

xzh25 · 2026-07-02T08:27:16Z

Thanks for explaining the MACA 3.8 requirement for vLLM-metax after v0.21.0.

Given that MACA 3.8 is not publicly available yet, could you recommend an issue or contribution area that external contributors can realistically validate and that maintainers would be willing to review/accept?

I want to avoid spending more time on work that cannot be validated outside the internal MACA 3.8 environment. For PR #294, I can keep it scoped as a related compatibility fix and wait for internal validation if that is useful. Otherwise, I am happy to switch to a maintainer-recommended issue that satisfies these constraints:

reproducible on publicly accessible MetaX/Gitee.AI environments or released Docker images;
does not require unreleased MACA 3.8 internals;
has a clear expected behavior/test plan;
is still relevant to the current vLLM-metax roadmap and not already fixed internally.

If there is a label, branch, roadmap item, or specific issue that you prefer external contributors to work on under the current public environment constraints, please point me to it. That would help me avoid creating noise or duplicate/unmergeable work.

Fix MetaX vLLM 0.23 MTP import compatibility

045198c

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

Handle missing torch.cuda in MetaX compat hooks

ff67715

xzh25 mentioned this pull request Jun 22, 2026

[Bug]: qwen 3.6 MTP 输出内容损坏 #261

Open

xzh25 marked this pull request as ready for review June 22, 2026 08:02

Test compat import without torch.cuda

0466e6f

Handle missing MoE disable_inplace config

82fccb2

xzh25 changed the title ~~Fix MetaX vLLM 0.23 MTP import compatibility~~ Fix MetaX vLLM 0.23 MTP compatibility Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MetaX vLLM 0.23 MTP compatibility#294

Fix MetaX vLLM 0.23 MTP compatibility#294
xzh25 wants to merge 4 commits into
MetaX-MACA:masterfrom
xzh25:fix-qwen3-mtp261-compat

xzh25 commented Jun 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

xzh25 Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

ILikeIneine commented Jul 2, 2026

Uh oh!

xzh25 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xzh25 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Before this fix

Validation

Additional 35BA3B GPTQ validation

2026-07-02 dual-C500 FlagRelease 35BA3B validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

xzh25 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

xzh25 commented Jun 22, 2026

Uh oh!

ILikeIneine commented Jul 2, 2026

Uh oh!

xzh25 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xzh25 commented Jun 22, 2026 •

edited

Loading