Fix MetaX vLLM 0.23 MTP compatibility#294
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces early compatibility hooks in vllm_metax/compat.py to address runtime mismatches between vLLM and MetaX, including defining missing custom operator schemas and mapping torch.cuda attributes to torch.accelerator when applicable. It also adds a test to verify these compatibility imports. Feedback was provided to safely retrieve the torch.cuda module using getattr to prevent potential AttributeError exceptions in CPU-only or non-CUDA environments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if hasattr(torch, "accelerator"): | ||
| for _name in ( | ||
| "current_device", | ||
| "device_count", | ||
| "empty_cache", | ||
| "is_available", | ||
| "max_memory_allocated", | ||
| "mem_get_info", | ||
| "memory_allocated", | ||
| "memory_reserved", | ||
| "memory_stats", | ||
| "reset_peak_memory_stats", | ||
| "set_device", | ||
| "synchronize", | ||
| ): | ||
| if not hasattr(torch.accelerator, _name) and hasattr(torch.cuda, _name): | ||
| setattr(torch.accelerator, _name, getattr(torch.cuda, _name)) | ||
| if ( | ||
| not hasattr(torch.accelerator, "current_device_index") | ||
| and hasattr(torch.cuda, "current_device") | ||
| ): | ||
| torch.accelerator.current_device_index = torch.cuda.current_device |
There was a problem hiding this comment.
Directly accessing torch.cuda can raise an AttributeError in environments where CUDA is not available or not compiled (such as CPU-only environments or certain specialized builds). To prevent import-time failures, retrieve the cuda module safely using getattr(torch, "cuda", None) before checking or copying its attributes.
if hasattr(torch, "accelerator"):
cuda_module = getattr(torch, "cuda", None)
if cuda_module is not None:
for _name in (
"current_device",
"device_count",
"empty_cache",
"is_available",
"max_memory_allocated",
"mem_get_info",
"memory_allocated",
"memory_reserved",
"memory_stats",
"reset_peak_memory_stats",
"set_device",
"synchronize",
):
if not hasattr(torch.accelerator, _name) and hasattr(cuda_module, _name):
setattr(torch.accelerator, _name, getattr(cuda_module, _name))
if (
not hasattr(torch.accelerator, "current_device_index")
and hasattr(cuda_module, "current_device")
):
torch.accelerator.current_device_index = cuda_module.current_deviceThere was a problem hiding this comment.
Addressed in ff67715.
I changed the compat hook to use cuda_module = getattr(torch, "cuda", None) before copying CUDA helpers onto torch.accelerator, so importing vllm_metax.compat will not assume a torch.cuda attribute exists.
Validated on the MetaX C500 environment:
python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch-> 1 passedpython -m py_compile vllm_metax/compat.py-> okgit diff --check-> ok
|
Additional MetaX validation completed after moving this PR to ready-for-review. Environment captured in
Validation matrix saved in
Both no-MTP and MTP paths generated normal output. The MTP cases also reported active spec-decode metrics, which confirms the Notes:
|
|
Added one more regression test in
Revalidated on the MetaX environment:
I also updated the PR description with the pre-fix failure logs and the latest validation matrix. |
|
I also ran a longer MTP output stress check to look specifically for the repeated/damaged-output symptom from #261. Log: Configuration:
Result:
No repeated-output loop was observed in this longer MTP run after the compatibility fix. |
|
Added one more compatibility fix and validation pass for 35BA3B MTP. New commit: What changed:
Validation on the MetaX C500 machine:
Logs:
The 35BA3B GPTQ output did not reproduce the corrupted repeated-output loop. The accepted-token count is still zero for this GPTQ checkpoint plus tokenizer workaround, so I am treating this as functional MTP-path coverage rather than a speedup claim. |
|
Additional validation update:
So the server API validation gap is closed for a known-good MTP model. The remaining 35BA3B limitation appears checkpoint/tokenizer-specific rather than a simple spec-token-count parameter issue. |
|
Follow-up validation on the official/FlagRelease 35BA3B dual-card path is now complete. Environment:
MTP result,
No-MTP comparison, same prompt/model/tokenizer/TP=2,
So the official/FlagRelease 35BA3B TP=2 path does show accepted-token MTP acceleration in this setup: about Remote logs:
Contest material update pushed: https://github.com/xzh25/vllm-metax-qwen3-mtp261/commit/fd1d4a7 |
|
Sorry for the late response. First thanks for you effort! vllm-metax requires maca3.8.0.x after v0.21.0. So all the work and validations on maca3.7(and lower) is stopped. But still, maca3.8 is not offically released yet, so only the internal developers can have it tested for latest vllm for now. |
|
Thanks for explaining the MACA 3.8 requirement for vLLM-metax after v0.21.0. Given that MACA 3.8 is not publicly available yet, could you recommend an issue or contribution area that external contributors can realistically validate and that maintainers would be willing to review/accept? I want to avoid spending more time on work that cannot be validated outside the internal MACA 3.8 environment. For PR #294, I can keep it scoped as a related compatibility fix and wait for internal validation if that is useful. Otherwise, I am happy to switch to a maintainer-recommended issue that satisfies these constraints:
If there is a label, branch, roadmap item, or specific issue that you prefer external contributors to work on under the current public environment constraints, please point me to it. That would help me avoid creating noise or duplicate/unmergeable work. |
Summary
Fix MetaX plugin startup for vLLM 0.23 Qwen3.5/Qwen3 MTP speculative decoding by installing early compatibility hooks before plugin-specific patches and vLLM compilation/fusion modules are imported.
Root cause
The vLLM 0.23 MTP path imports quantization/fusion modules early, including
vllm._custom_opsandvllm.compilation.passes.fusion.act_quant_fusion. On the current MetaX runtime those imports can reference_C::scaled_fp4_quant.out,_C::silu_and_mul_per_block_quant, and newertorch.acceleratorhelpers before the MetaX plugin has registered compatible shims. In spawned EngineCore workers this prevents Qwen3.5 MTP from reaching generation.Changes
vllm_metax.compatand import it at package import time._Ccustom ops.torch.acceleratorhelpers fromtorch.cudawhen that module exists.triton_supportimport path loading the same compatibility hooks.act_quant_fusionimport and import safety whentorch.cudais absent.Before this fix
On the same MetaX validation environment, MTP could fail before generation:
079_qwen35_9b_mtp261_smoke.txt:RuntimeError: operator _C::scaled_fp4_quant.out does not exist084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt:AttributeError: module 'torch.accelerator' has no attribute 'empty_cache'088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt:RuntimeError: operator _C::scaled_fp4_quant.out does not existValidation
Remote MetaX C500 environment:
Checks:
python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch-> 2 passedimport vllm_metax.quant_config; import vllm.compilation.passes.fusion.act_quant_fusion-> okpython -m py_compile ...-> okgit diff --check-> okValidation matrix using the same prompt and
max_tokens=96:Both no-MTP and MTP paths generated normal output. The MTP cases also reported active spec-decode metrics, confirming the
qwen3_next_mtp/mtppath was exercised rather than falling back to regular decoding.Logs are archived on the validation machine under
/data/vllm-metax-contribution/ops/logs/079_*,084_*,088_*,092_*, and093_*.Notes:
TokenizersBackendtokenizer config that this Transformers environment cannot instantiate directly; the 27B smoke/validation runs use the local Qwen3.5-9B tokenizer workaround.Additional 35BA3B GPTQ validation
After the original import compatibility fix, I also downloaded and tested a ModelScope 35BA3B GPTQ checkpoint because the validation machine did not initially have a runnable 35BA3B MTP model:
potter001/gptq-Qwen3.6-35B-A3B-4bit-groupto/data/vllm-metax-contribution/models/gptq-Qwen3.6-35B-A3B-4bit-groupqwen3_5_moe,mtp_num_hidden_layers=1,quant_method=gptq, 7 safetensors shards, 19.27 GiB after removing Git LFS cachefp8 quantization is currently not supported in macaThe GPTQ 35BA3B path exposed one additional vLLM 0.23 compatibility issue:
MoeWNA16Method.apply()readself.moe.disable_inplace, but the currentFusedMoEConfigdoes not always define that field. This PR now falls back toFalsewhen the field is absent and adds a focused regression test.35BA3B GPTQ + MTP validation after that fix:
python -m pytest -q --confcutdir=tests/patch tests/patch/test_triton_custom_op_schemas.py tests/patch/test_moe_wna16_compat.py: 3 passed/data/vllm-metax-contribution/ops/logs/103_qwen36_35ba3b_gptq_mtp_after_moe_wna16_fix.txtLOAD_S 130.03drafts=31,draft_tokens=62,accepted=0/data/vllm-metax-contribution/ops/logs/104_qwen36_35ba3b_gptq_mtp_issue_like_check.txtOUTPUT_TOKENS_PER_S 9.23790.0806, trigram0.0492, 4-gram0.0167drafts=127,draft_tokens=254,accepted=0,accepted_tokens_per_pos=[0, 0]The 35BA3B GPTQ output did not show the repeated corrupted loop reported in #261. The zero accepted-token count is called out explicitly because this GPTQ checkpoint plus tokenizer workaround exercises the MTP drafter path but does not demonstrate speedup.
Related to #261. This PR fixes MTP compatibility blockers found while reproducing #261. It does not claim to close the original repeated-output corruption without the exact original reproduction prompt and sampling parameters.
2026-07-02 dual-C500 FlagRelease 35BA3B validation
Validation environment:
FlagRelease/Qwen3.6-35B-A3B-nomtp-metax-FlagOSBefore/after compatibility check:
ce08bf57worktree fails while importing the MTP/triton-support path withRuntimeError: operator _C::scaled_fp4_quant.out does not exist.vllm_metax.patch.bugfix.triton_support.custom_op_schemasandvllm.compilation.passes.fusion.act_quant_fusionimport successfully;_C::scaled_fp4_quant.outand_C::silu_and_mul_per_block_quantschemas are present.python -m pytest -q --confcutdir=tests/patch tests/patch/test_triton_custom_op_schemas.py tests/patch/test_moe_wna16_compat.py-> 3 passed.35BA3B MTP/no-MTP comparison on the same prompt:
[46, 33][63]The MTP runs initialized as
SpeculativeConfig(method='mtp', num_spec_tokens=...), loaded the MTP drafter, emittedvllm:spec_decode_*metrics, and did not show the repeated corrupted-output loop in this validation prompt.