Skip to content

Fix MetaX vLLM 0.23 MTP compatibility#294

Open
xzh25 wants to merge 4 commits into
MetaX-MACA:masterfrom
xzh25:fix-qwen3-mtp261-compat
Open

Fix MetaX vLLM 0.23 MTP compatibility#294
xzh25 wants to merge 4 commits into
MetaX-MACA:masterfrom
xzh25:fix-qwen3-mtp261-compat

Conversation

@xzh25

@xzh25 xzh25 commented Jun 22, 2026

Copy link
Copy Markdown

Summary

Fix MetaX plugin startup for vLLM 0.23 Qwen3.5/Qwen3 MTP speculative decoding by installing early compatibility hooks before plugin-specific patches and vLLM compilation/fusion modules are imported.

Root cause

The vLLM 0.23 MTP path imports quantization/fusion modules early, including vllm._custom_ops and vllm.compilation.passes.fusion.act_quant_fusion. On the current MetaX runtime those imports can reference _C::scaled_fp4_quant.out, _C::silu_and_mul_per_block_quant, and newer torch.accelerator helpers before the MetaX plugin has registered compatible shims. In spawned EngineCore workers this prevents Qwen3.5 MTP from reaching generation.

Changes

  • Add vllm_metax.compat and import it at package import time.
  • Register fragment schemas for the missing import-time _C custom ops.
  • Backfill missing torch.accelerator helpers from torch.cuda when that module exists.
  • Keep the legacy triton_support import path loading the same compatibility hooks.
  • Add focused regression tests covering act_quant_fusion import and import safety when torch.cuda is absent.

Before this fix

On the same MetaX validation environment, MTP could fail before generation:

  • 079_qwen35_9b_mtp261_smoke.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist
  • 084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt: AttributeError: module 'torch.accelerator' has no attribute 'empty_cache'
  • 088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist

Validation

Remote MetaX C500 environment:

  • Driver: 3.8.30
  • MACA SDK: 3.5.3.20
  • Python: 3.10.10
  • vLLM: 0.23.0
  • torch: 2.8.0+metax3.5.3.9

Checks:

  • python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch -> 2 passed
  • import vllm_metax.quant_config; import vllm.compilation.passes.fusion.act_quant_fusion -> ok
  • python -m py_compile ... -> ok
  • git diff --check -> ok

Validation matrix using the same prompt and max_tokens=96:

Case MTP Load s Generate s Output tok/s Repeated bigram ratio Spec decode
Qwen3.5-9B no 90.289 3.995 24.0311 0.0125 n/a
Qwen3.5-9B yes 93.406 3.352 28.6370 0.0125 drafts=45, draft_tokens=90, accepted=50, per_pos=[30, 20]
Qwen3.5-27B-W8A8 no 99.703 9.338 10.2801 0.0444 n/a
Qwen3.5-27B-W8A8 yes 102.532 4.434 21.6529 0.0000 drafts=35, draft_tokens=70, accepted=60, per_pos=[31, 29]

Both no-MTP and MTP paths generated normal output. The MTP cases also reported active spec-decode metrics, confirming the qwen3_next_mtp/mtp path was exercised rather than falling back to regular decoding.

Logs are archived on the validation machine under /data/vllm-metax-contribution/ops/logs/079_*, 084_*, 088_*, 092_*, and 093_*.

Notes:

  • The available 30B-A3B model configs on the validation machine do not contain MTP metadata, so they were not used as Qwen3.5 MTP substitutes.
  • Qwen3.5-27B-W8A8 has a TokenizersBackend tokenizer config that this Transformers environment cannot instantiate directly; the 27B smoke/validation runs use the local Qwen3.5-9B tokenizer workaround.

Additional 35BA3B GPTQ validation

After the original import compatibility fix, I also downloaded and tested a ModelScope 35BA3B GPTQ checkpoint because the validation machine did not initially have a runnable 35BA3B MTP model:

  • Downloaded: potter001/gptq-Qwen3.6-35B-A3B-4bit-group to /data/vllm-metax-contribution/models/gptq-Qwen3.6-35B-A3B-4bit-group
  • Model metadata: qwen3_5_moe, mtp_num_hidden_layers=1, quant_method=gptq, 7 safetensors shards, 19.27 GiB after removing Git LFS cache
  • Also checked but did not use for final validation:
    • official FP8: MetaX rejects fp8 quantization is currently not supported in maca
    • Eco-Tech W8A8: Ascend/msmodelslim format lacks vLLM quantization config and OOMs as unquantized MoE
    • FlagRelease Metax BF16: ~71.9 GiB and documented for TP=2, not usable on the single C500 validation setup

The GPTQ 35BA3B path exposed one additional vLLM 0.23 compatibility issue: MoeWNA16Method.apply() read self.moe.disable_inplace, but the current FusedMoEConfig does not always define that field. This PR now falls back to False when the field is absent and adds a focused regression test.

35BA3B GPTQ + MTP validation after that fix:

  • python -m pytest -q --confcutdir=tests/patch tests/patch/test_triton_custom_op_schemas.py tests/patch/test_moe_wna16_compat.py: 3 passed
  • Smoke log: /data/vllm-metax-contribution/ops/logs/103_qwen36_35ba3b_gptq_mtp_after_moe_wna16_fix.txt
    • loaded target + MTP drafter, LOAD_S 130.03
    • generated 32 tokens, drafts=31, draft_tokens=62, accepted=0
  • Issue-like long check: /data/vllm-metax-contribution/ops/logs/104_qwen36_35ba3b_gptq_mtp_issue_like_check.txt
    • generated 128 tokens, OUTPUT_TOKENS_PER_S 9.2379
    • repetition ratios: bigram 0.0806, trigram 0.0492, 4-gram 0.0167
    • spec-decode metrics: drafts=127, draft_tokens=254, accepted=0, accepted_tokens_per_pos=[0, 0]

The 35BA3B GPTQ output did not show the repeated corrupted loop reported in #261. The zero accepted-token count is called out explicitly because this GPTQ checkpoint plus tokenizer workaround exercises the MTP drafter path but does not demonstrate speedup.

Related to #261. This PR fixes MTP compatibility blockers found while reproducing #261. It does not claim to close the original repeated-output corruption without the exact original reproduction prompt and sampling parameters.

2026-07-02 dual-C500 FlagRelease 35BA3B validation

Validation environment:

  • 2 x MetaX C500 64GB
  • MACA 3.5.3.20, driver 3.8.30
  • torch 2.8.0+metax3.5.3.9
  • vLLM 0.23.0
  • model: FlagRelease/Qwen3.6-35B-A3B-nomtp-metax-FlagOS
  • model weights: 21 safetensors, 71,903,776,768 bytes
  • tensor_parallel_size=2, max_model_len=2048, temperature=0.0, max_tokens=128

Before/after compatibility check:

  • Before this patch, the base ce08bf57 worktree fails while importing the MTP/triton-support path with RuntimeError: operator _C::scaled_fp4_quant.out does not exist.
  • After this patch, vllm_metax.patch.bugfix.triton_support.custom_op_schemas and vllm.compilation.passes.fusion.act_quant_fusion import successfully; _C::scaled_fp4_quant.out and _C::silu_and_mul_per_block_quant schemas are present.
  • Focused tests: python -m pytest -q --confcutdir=tests/patch tests/patch/test_triton_custom_op_schemas.py tests/patch/test_moe_wna16_compat.py -> 3 passed.

35BA3B MTP/no-MTP comparison on the same prompt:

Case MTP num_speculative_tokens output tok/s repeat 2g repeat 3g repeat 4g accepted
no-MTP no n/a 7.3670 0.0267 0.0 0.0 n/a
MTP yes 2 12.8181 0.0267 0.0 0.0 79/98, per pos [46, 33]
MTP yes 1 14.1913 0.0267 0.0 0.0 63/65, per pos [63]

The MTP runs initialized as SpeculativeConfig(method='mtp', num_spec_tokens=...), loaded the MTP drafter, emitted vllm:spec_decode_* metrics, and did not show the repeated corrupted-output loop in this validation prompt.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces early compatibility hooks in vllm_metax/compat.py to address runtime mismatches between vLLM and MetaX, including defining missing custom operator schemas and mapping torch.cuda attributes to torch.accelerator when applicable. It also adds a test to verify these compatibility imports. Feedback was provided to safely retrieve the torch.cuda module using getattr to prevent potential AttributeError exceptions in CPU-only or non-CUDA environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread vllm_metax/compat.py Outdated
Comment on lines +44 to +65
if hasattr(torch, "accelerator"):
for _name in (
"current_device",
"device_count",
"empty_cache",
"is_available",
"max_memory_allocated",
"mem_get_info",
"memory_allocated",
"memory_reserved",
"memory_stats",
"reset_peak_memory_stats",
"set_device",
"synchronize",
):
if not hasattr(torch.accelerator, _name) and hasattr(torch.cuda, _name):
setattr(torch.accelerator, _name, getattr(torch.cuda, _name))
if (
not hasattr(torch.accelerator, "current_device_index")
and hasattr(torch.cuda, "current_device")
):
torch.accelerator.current_device_index = torch.cuda.current_device

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Directly accessing torch.cuda can raise an AttributeError in environments where CUDA is not available or not compiled (such as CPU-only environments or certain specialized builds). To prevent import-time failures, retrieve the cuda module safely using getattr(torch, "cuda", None) before checking or copying its attributes.

if hasattr(torch, "accelerator"):
    cuda_module = getattr(torch, "cuda", None)
    if cuda_module is not None:
        for _name in (
            "current_device",
            "device_count",
            "empty_cache",
            "is_available",
            "max_memory_allocated",
            "mem_get_info",
            "memory_allocated",
            "memory_reserved",
            "memory_stats",
            "reset_peak_memory_stats",
            "set_device",
            "synchronize",
        ):
            if not hasattr(torch.accelerator, _name) and hasattr(cuda_module, _name):
                setattr(torch.accelerator, _name, getattr(cuda_module, _name))
        if (
            not hasattr(torch.accelerator, "current_device_index")
            and hasattr(cuda_module, "current_device")
        ):
            torch.accelerator.current_device_index = cuda_module.current_device

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ff67715.

I changed the compat hook to use cuda_module = getattr(torch, "cuda", None) before copying CUDA helpers onto torch.accelerator, so importing vllm_metax.compat will not assume a torch.cuda attribute exists.

Validated on the MetaX C500 environment:

  • python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch -> 1 passed
  • python -m py_compile vllm_metax/compat.py -> ok
  • git diff --check -> ok

@xzh25 xzh25 marked this pull request as ready for review June 22, 2026 08:02
@xzh25

xzh25 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Additional MetaX validation completed after moving this PR to ready-for-review.

Environment captured in /data/vllm-metax-contribution/ops/logs/092_collect_env_mtp261.txt:

  • GPU: MetaX C500
  • Driver: 3.8.30
  • MACA SDK: 3.5.3.20
  • Python: 3.10.10
  • vLLM: 0.23.0
  • torch: 2.8.0+metax3.5.3.9
  • vLLM-metax package baseline: 0.17.0+gd10261.d20260409.maca3.5.3.20.torch2.8

Validation matrix saved in /data/vllm-metax-contribution/ops/logs/093_mtp261_validation_matrix.txt using the same prompt and max_tokens=96:

Case MTP Load s Generate s Output tok/s Repeated bigram ratio Spec decode
Qwen3.5-9B no 90.289 3.995 24.0311 0.0125 n/a
Qwen3.5-9B yes 93.406 3.352 28.6370 0.0125 drafts=45, draft_tokens=90, accepted=50, per_pos=[30, 20]
Qwen3.5-27B-W8A8 no 99.703 9.338 10.2801 0.0444 n/a
Qwen3.5-27B-W8A8 yes 102.532 4.434 21.6529 0.0000 drafts=35, draft_tokens=70, accepted=60, per_pos=[31, 29]

Both no-MTP and MTP paths generated normal output. The MTP cases also reported active spec-decode metrics, which confirms the qwen3_next_mtp/mtp path was exercised rather than falling back to regular decoding.

Notes:

  • The available 30B-A3B model configs on this validation machine do not contain MTP metadata, so they were not used as substitutes for the Qwen3.5 MTP models.
  • Qwen3.5-27B-W8A8 has a TokenizersBackend tokenizer config that this Transformers environment cannot instantiate directly; the 27B smoke/validation runs use the local Qwen3.5-9B tokenizer workaround.

@xzh25

xzh25 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Added one more regression test in 0466e6f for the review feedback path:

  • test_compat_import_tolerates_missing_torch_cuda temporarily removes torch.cuda and reloads vllm_metax.compat, covering CPU-only/specialized torch builds where the attribute may be absent.

Revalidated on the MetaX environment:

  • python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch -> 2 passed
  • python -m py_compile tests/patch/test_triton_custom_op_schemas.py vllm_metax/compat.py -> ok
  • git diff --check -> ok

I also updated the PR description with the pre-fix failure logs and the latest validation matrix.

@xzh25

xzh25 commented Jun 22, 2026

Copy link
Copy Markdown
Author

I also ran a longer MTP output stress check to look specifically for the repeated/damaged-output symptom from #261.

Log: /data/vllm-metax-contribution/ops/logs/094_qwen35_27b_w8a8_mtp_long_output_check.txt

Configuration:

  • Model: Qwen3.5-27B-W8A8
  • Tokenizer workaround: local Qwen3.5-9B tokenizer
  • speculative_config={"method":"qwen3_next_mtp","num_speculative_tokens":2}
  • max_tokens=256

Result:

  • Generated 256 output tokens normally
  • Output throughput: 23.2837 tok/s
  • Spec decode metrics: drafts=99, draft_tokens=198, accepted=158, accepted_tokens_per_pos=[85, 73]
  • Repetition metrics: bigram repeat ratio 0.0853, trigram 0.0391, 4-gram 0.0157
  • Most repeated 4-gram was Genshin Impact Version 5.0 with count 3, which is attributable to the requested topic rather than a looped/damaged output pattern

No repeated-output loop was observed in this longer MTP run after the compatibility fix.

@xzh25 xzh25 changed the title Fix MetaX vLLM 0.23 MTP import compatibility Fix MetaX vLLM 0.23 MTP compatibility Jun 22, 2026
@xzh25

xzh25 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Added one more compatibility fix and validation pass for 35BA3B MTP.

New commit: 82fccb2 Handle missing MoE disable_inplace config

What changed:

  • MoeWNA16Method.apply() now tolerates vLLM 0.23 FusedMoEConfig objects that do not define disable_inplace, defaulting to the existing in-place behavior.
  • Added tests/patch/test_moe_wna16_compat.py to cover that compatibility case.

Validation on the MetaX C500 machine:

  • python -m pytest -q --confcutdir=tests/patch tests/patch/test_triton_custom_op_schemas.py tests/patch/test_moe_wna16_compat.py -> 3 passed
  • Downloaded and tested potter001/gptq-Qwen3.6-35B-A3B-4bit-group from ModelScope because the machine did not initially have a runnable 35BA3B MTP model.
  • 35BA3B GPTQ + MTP smoke after the fix: target + MTP drafter loaded, generated 32 tokens, spec metrics reported drafts=31, draft_tokens=62, accepted=0.
  • 35BA3B GPTQ + MTP issue-like 128-token check: generated normally at 9.2379 tok/s; repeat ratios were bigram 0.0806, trigram 0.0492, 4-gram 0.0167; spec metrics reported drafts=127, draft_tokens=254, accepted=0.

Logs:

  • /data/vllm-metax-contribution/ops/logs/103_qwen36_35ba3b_gptq_mtp_after_moe_wna16_fix.txt
  • /data/vllm-metax-contribution/ops/logs/104_qwen36_35ba3b_gptq_mtp_issue_like_check.txt

The 35BA3B GPTQ output did not reproduce the corrupted repeated-output loop. The accepted-token count is still zero for this GPTQ checkpoint plus tokenizer workaround, so I am treating this as functional MTP-path coverage rather than a speedup claim.

@xzh25

xzh25 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Additional validation update:

  1. OpenAI-compatible server API path is now covered on the MetaX C500 machine using Qwen3.5-9B + MTP:
  • vllm serve ... --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
  • /v1/models returned the served model.
  • /v1/completions returned a completion successfully.
  • /metrics exposed spec-decode counters: drafts=18, draft_tokens=36, accepted_tokens=31, accepted_tokens_per_pos=[17, 14].
  • Log: /data/vllm-metax-contribution/ops/logs/105_qwen35_9b_mtp_openai_server_api.txt
  1. I also checked whether the 35BA3B GPTQ accepted-token issue was just caused by num_speculative_tokens=2. With num_speculative_tokens=1, the model still generated normally but accepted tokens stayed at zero:
  • drafts=63, draft_tokens=63, accepted=0, accepted_tokens_per_pos=[0]
  • Log: /data/vllm-metax-contribution/ops/logs/106_qwen36_35ba3b_gptq_mtp_one_token_check.txt

So the server API validation gap is closed for a known-good MTP model. The remaining 35BA3B limitation appears checkpoint/tokenizer-specific rather than a simple spec-token-count parameter issue.

@xzh25

xzh25 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Follow-up validation on the official/FlagRelease 35BA3B dual-card path is now complete.

Environment:

  • 2x MetaX C500 64GB, MACA 3.5.3.20
  • vLLM 0.23.0, tensor_parallel_size=2, CUDA_VISIBLE_DEVICES=0,1
  • model: FlagRelease/Qwen3.6-35B-A3B-nomtp-metax-FlagOS
  • model integrity: 21 safetensors shards, 71,903,776,768 safetensors bytes, 19 indexed MTP weights, no missing shards

MTP result, max_tokens=128:

  • target model + MTP drafter loaded successfully (Qwen3_5MoeForConditionalGeneration + Qwen3_5MoeMTP)
  • drafts=49, draft_tokens=98, accepted_tokens=79, accepted_tokens_per_pos=[46, 33]
  • GEN_S=6.293, OUTPUT_TOKENS_PER_S=20.3389

No-MTP comparison, same prompt/model/tokenizer/TP=2, max_tokens=128:

  • GEN_S=12.054, OUTPUT_TOKENS_PER_S=10.619

So the official/FlagRelease 35BA3B TP=2 path does show accepted-token MTP acceleration in this setup: about 1.915x output-token throughput for this prompt. This also explains the earlier single-card GPTQ 35BA3B result more narrowly: that checkpoint/tokenizer/quantization path ran functionally but had accepted_tokens=0, while the official dual-card checkpoint produces accepted tokens normally.

Remote logs:

  • /data/vllm-metax-contribution/ops/logs/202_dual_c500_flagrelease_35ba3b_mtp_smoke.log
  • /data/vllm-metax-contribution/ops/logs/203_dual_c500_flagrelease_35ba3b_no_mtp_smoke.log
  • /data/vllm-metax-contribution/ops/logs/204_dual_c500_flagrelease_35ba3b_mtp_128.log
  • /data/vllm-metax-contribution/ops/logs/205_dual_c500_flagrelease_35ba3b_no_mtp_128.log

Contest material update pushed: https://github.com/xzh25/vllm-metax-qwen3-mtp261/commit/fd1d4a7

@ILikeIneine

Copy link
Copy Markdown
Member

Sorry for the late response. First thanks for you effort! vllm-metax requires maca3.8.0.x after v0.21.0. So all the work and validations on maca3.7(and lower) is stopped. But still, maca3.8 is not offically released yet, so only the internal developers can have it tested for latest vllm for now.

@xzh25

xzh25 commented Jul 2, 2026

Copy link
Copy Markdown
Author

Thanks for explaining the MACA 3.8 requirement for vLLM-metax after v0.21.0.

Given that MACA 3.8 is not publicly available yet, could you recommend an issue or contribution area that external contributors can realistically validate and that maintainers would be willing to review/accept?

I want to avoid spending more time on work that cannot be validated outside the internal MACA 3.8 environment. For PR #294, I can keep it scoped as a related compatibility fix and wait for internal validation if that is useful. Otherwise, I am happy to switch to a maintainer-recommended issue that satisfies these constraints:

  • reproducible on publicly accessible MetaX/Gitee.AI environments or released Docker images;
  • does not require unreleased MACA 3.8 internals;
  • has a clear expected behavior/test plan;
  • is still relevant to the current vLLM-metax roadmap and not already fixed internally.

If there is a label, branch, roadmap item, or specific issue that you prefer external contributors to work on under the current public environment constraints, please point me to it. That would help me avoid creating noise or duplicate/unmergeable work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants