Add MiniMax-M3 NVFP4 B300 single-node aggregated vLLM benchmark by Ankur-singh · Pull Request #1928 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-06-25T01:43:03Z

Adds the minimaxm3-fp4-b300-vllm config: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B300, no spec decode.

Config: nvidia-master.yaml entry (fp4 / vllm / runner b300), sweeping tp 2/4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-1024.
Recipe: benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300.sh — overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) onto the perf image before serve; --block-size 128 (MSA), --language-model-only.
Weights: pre-staged read-only at /scratch/models/MiniMax-M3-NVFP4 — added MiniMax-M3-NVFP4 to the STAGED_MODELS allow-list in launch_b300-nv.sh so MODEL_PATH resolves to the staged path instead of downloading.
perf-changelog entry appended.

New minimaxm3-fp4-b300-vllm config (fp4 vLLM aggregated on b300). The benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve. Weights are pre-staged read-only at /scratch/models/MiniMax-M3-NVFP4 (added MiniMax-M3-NVFP4 to the launch_b300-nv.sh STAGED_MODELS allow-list).

github-actions · 2026-06-25T01:43:10Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude · 2026-06-25T01:48:54Z

+# recognise the NVFP4 quant config and falls back to an unsupported path.
+VLLM_DIR=$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))")
+for f in \
+  model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
+  model_executor/layers/quantization/modelopt.py \
+  model_executor/layers/quantization/utils/flashinfer_utils.py
+do
+  curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
+done
+python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"


🔴 The vLLM PR #46380 overlay loop at lines 27-33 calls curl -fsSL three times without checking exit status, and the script does not enable set -e / set -o pipefail. A transient failure (rate limit, TLS hiccup, or a 404 if commit 6c08558 is force-pushed) on any of the three fetches will leave the corresponding file in the installed vLLM package empty/truncated, and the loop will proceed to start vllm serve against an inconsistent install — silently producing misleading benchmark numbers. Add set -euo pipefail at the top of the script (or append || exit 1 to each curl and to the python verification on line 34), matching the fail-fast pattern already used in the FP8 sibling at benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300.sh:31.

Extended reasoning...

What the bug is. Lines 25-34 of benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300.sh overlay three files from raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/... directly onto the installed vLLM package ($VLLM_DIR) using curl -fsSL ... -o .... None of these curl calls check their exit status, the script has no set -e / set -o errexit / set -o pipefail at the top (only set -x / set +x later around the vllm serve invocation), and the post-loop python sanity check on line 34 only imports one symbol (TrtLlmNvFp4ExpertsModular) from one of the three files — modelopt.py and flashinfer_utils.py are never validated. Even if that one python import raised, without set -e the script would continue to vllm serve anyway.\n\nWhy it manifests. With -f, curl exits non-zero on HTTP 4xx/5xx; with -o it has already opened/truncated the output file before the request completes. So a transient failure (GitHub Raw rate limit 429, TLS handshake, transient 5xx, DNS hiccup, or — for a long-lived recipe — a 404 if vllm-project ever rebases away from commit 6c08558) leaves the corresponding file in the installed package empty or truncated. Bash without set -e happily continues to the next loop iteration, and then to vllm serve.\n\nStep-by-step proof. Suppose GitHub Raw returns a transient 429 on the second iteration (modelopt.py):\n 1. Iteration 1 (trtllm_nvfp4_moe.py): 200 OK — file written correctly.\n 2. Iteration 2 (modelopt.py): 429 — curl truncates $VLLM_DIR/model_executor/layers/quantization/modelopt.py to 0 bytes and exits non-zero. Loop continues because no set -e / || exit 1.\n 3. Iteration 3 (flashinfer_utils.py): 200 OK — file written correctly.\n 4. Line 34 imports TrtLlmNvFp4ExpertsModular from trtllm_nvfp4_moe.py. This succeeds because the symbol's transitive imports may not reach the truncated modelopt.py at import time.\n 5. vllm serve launches. Either it crashes much later with a confusing ImportError deep inside model loading, or — worse — vLLM falls back to a non-NVFP4 path and runs the benchmark, producing wrong throughput/latency numbers that get committed to the perf-changelog.\n\nWhy existing code doesn't prevent it. Reading the full file confirms there is no set -e / set -o errexit / set -o pipefail anywhere — only set -x around vllm serve. The only set -e in the project I found is inside a function in benchmark_lib.sh that is scoped between set +e and set -e calls and does not propagate to the caller. The python verification on line 34 covers only one of the three patched files and its exit code is also unchecked.\n\nImpact. A bench recipe that silently runs against a partially-patched vLLM produces misleading perf numbers. That is strictly worse than a clean abort, because nothing alerts the operator and the bad numbers can land in perf-changelog.yaml (PR #1928 already adds a minimaxm3-fp4-b300-vllm perf-changelog entry).\n\nHow to fix. Two trivial options, either is sufficient: (a) add set -euo pipefail at the top of the script — this also matches the standard bash safety idiom; or (b) append || exit 1 to each curl call and to the python3 -c "..." verification. Option (b) is the exact pattern already used in the FP8 sibling benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300.sh:31:\n\nbash\npython3 - <<'PYEOF' || { echo "MiniMax M3 MSA contiguity patch failed" >&2; exit 1; }\n...\nPYEOF\n\n\nUsing the same pattern here keeps the two recipes consistent.

claude · 2026-06-25T01:48:54Z

+    - "Image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-7a67223; benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve"
+    - "Weights pre-staged read-only at /scratch/models/MiniMax-M3-NVFP4 (added MiniMax-M3-NVFP4 to launch_b300-nv.sh STAGED_MODELS); --block-size 128 (MSA), --language-model-only"
+    - "Sweeps tp 2/4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-1024"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX


🔴 The new perf-changelog entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — a literal placeholder template token that was never filled in. This is PR #1928; replace XXX with 1928 before merge so the changelog entry stays navigable back to the PR (every other entry in the file uses a real numeric PR id).

Extended reasoning...

What's wrong

In perf-changelog.yaml at line 4194, the new entry for minimaxm3-fp4-b300-vllm ends with:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

The XXX is a literal placeholder template token that was clearly meant to be replaced with the real PR number before opening/merging, but never was.

Why this is a real (if small) defect

perf-changelog.yaml is a permanent, append-only history of perf-config changes. Each entry's pr-link field is the canonical back-reference from the changelog row to the PR that introduced it. Every existing entry in the file follows this convention with a real numeric PR id — for example the entry immediately preceding this one (line 4185) uses .../pull/1927, and other recent ones use .../pull/1865, .../pull/1762, etc. Merging with the literal XXX leaves a permanently broken backreference: the URL resolves to a 404 (GitHub PR ids are integers only), and there is no automated job in this repo that rewrites the placeholder at merge time. Once merged, fixing it requires a follow-up PR that touches the changelog history.

Why existing review didn't catch it

There is no schema validation or CI check enforcing that pr-link matches ^https://.../pull/[0-9]+$ — the file is hand-edited, so the placeholder slips through. The PR-comment template reminder bot (the only timeline comment so far) is a generic "please mirror upstream recipes" reminder and does not look at YAML content.

Impact

Cosmetic-only — no benchmark behavior is affected. But this is the changelog entry that future readers will use to find the PR that added minimaxm3-fp4-b300-vllm, and a broken link defeats that purpose. The fix is a single-character edit.

Fix

Replace line 4194:

- pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1928

Step-by-step proof

The PR metadata above shows this is PR Add MiniMax-M3 NVFP4 B300 single-node aggregated vLLM benchmark #1928 ("Add MiniMax-M3 NVFP4 B300 single-node aggregated vLLM benchmark").

The diff for perf-changelog.yaml adds, as the very last line, pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.

Open that URL — https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — and GitHub returns 404 (PR ids are integers; XXX is not a valid PR number).

Open https://github.com/SemiAnalysisAI/InferenceX/pull/1928 and it resolves to this PR.

Inspect the immediately preceding changelog entry (line 4185): pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1927 — a real numeric id, confirming the established convention is "real PR number, not a placeholder".

So the entry will merge with a permanently un-navigable link unless XXX is replaced with 1928 before merge.

github-actions · 2026-06-25T06:41:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28145381323
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28145381323

Ankur-singh requested a review from a team June 25, 2026 01:43

Ankur-singh requested review from jgangani and kedarpotdar-nv as code owners June 25, 2026 01:43

github-project-automation Bot added this to InferenceMAX Board Jun 25, 2026

Update perf-changelog pr-link for #1928

5576f40

claude Bot reviewed Jun 25, 2026

View reviewed changes

xinli-sw added the full-sweep-enabled label Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MiniMax-M3 NVFP4 B300 single-node aggregated vLLM benchmark#1928

Add MiniMax-M3 NVFP4 B300 single-node aggregated vLLM benchmark#1928
Ankur-singh wants to merge 2 commits into
mainfrom
minimaxm3-fp4-b300-vllm

Ankur-singh commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Ankur-singh commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

What's wrong

Why this is a real (if small) defect

Why existing review didn't catch it

Impact

Fix

Step-by-step proof

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants