-
Notifications
You must be signed in to change notification settings - Fork 208
Add MiniMax-M3 NVFP4 B300 single-node vLLM benchmark (EAGLE3 spec decode) #1929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # MiniMax-M3 NVFP4 B300 single-node vLLM recipe with EAGLE3 speculative | ||
| # decoding — same shape as minimaxm3_fp8_b300_mtp.sh but uses the | ||
| # nvidia/MiniMax-M3-NVFP4 checkpoint. Applies vllm-project/vllm PR #46380 | ||
| # (MiniMax-M3 modelopt NVFP4 support) from commit 6c08558 by overwriting the | ||
| # 3 changed files in the installed vLLM package before the server starts. | ||
|
|
||
| source "$(dirname "$0")/../../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| EP_SIZE \ | ||
| DP_ATTENTION \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| # Apply vllm-project/vllm PR #46380 (Add MiniMax-M3 modelopt NVFP4 support, commit 6c08558). | ||
| # This patch is required for nvidia/MiniMax-M3-NVFP4: without it vLLM does not | ||
| # recognise the NVFP4 quant config and falls back to an unsupported path. | ||
| VLLM_DIR=$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))") | ||
| for f in \ | ||
| model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \ | ||
| model_executor/layers/quantization/modelopt.py \ | ||
| model_executor/layers/quantization/utils/flashinfer_utils.py | ||
| do | ||
| curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" | ||
|
Check warning on line 32 in benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh
|
||
| done | ||
| python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')" | ||
|
|
||
| DRAFT_MODEL="Inferact/MiniMax-M3-EAGLE3" | ||
|
|
||
| # The target weights are launched from MODEL_PATH (the b300 launcher points it | ||
| # at the pre-staged read-only /scratch/models/MiniMax-M3-NVFP4). The EAGLE3 | ||
| # draft is not pre-staged and must be downloaded, so it cannot live next to the | ||
| # read-only target — fetch it into the writable models dir (/data/models) | ||
| # instead. When MODEL_PATH is unset (stand-alone runs) fall back to the HF cache. | ||
| if [[ -n "${MODEL_PATH:-}" ]]; then | ||
| if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then | ||
| hf download "$MODEL" --local-dir "$MODEL_PATH" | ||
| fi | ||
| DRAFT_MODEL_PATH="/data/models/${DRAFT_MODEL##*/}" | ||
| if [[ ! -d "$DRAFT_MODEL_PATH" || -z "$(ls -A "$DRAFT_MODEL_PATH" 2>/dev/null)" ]]; then | ||
| hf download "$DRAFT_MODEL" --local-dir "$DRAFT_MODEL_PATH" | ||
| fi | ||
| else | ||
| hf download "$MODEL" | ||
| export MODEL_PATH="$MODEL" | ||
| hf download "$DRAFT_MODEL" | ||
| DRAFT_MODEL_PATH="$DRAFT_MODEL" | ||
| fi | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
|
|
||
| export VLLM_ENGINE_READY_TIMEOUT_S=3600 | ||
| export VLLM_FLOAT32_MATMUL_PRECISION=high | ||
|
|
||
| if [ "${DP_ATTENTION}" = "true" ]; then | ||
| PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel" | ||
| elif [ "$EP_SIZE" -gt 1 ]; then | ||
| PARALLEL_ARGS="--tensor-parallel-size=$TP --enable-expert-parallel" | ||
| else | ||
| PARALLEL_ARGS="--tensor-parallel-size=$TP" | ||
| fi | ||
|
|
||
| # use 3 speculative tokens for all configs for now | ||
| NUM_SPEC_TOKENS=3 | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve "$MODEL_PATH" --served-model-name "$MODEL" --host 0.0.0.0 --port $PORT \ | ||
| $PARALLEL_ARGS \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --block-size 128 \ | ||
| --language-model-only \ | ||
| --max-cudagraph-capture-size 2048 \ | ||
| --max-num-batched-tokens "$((ISL * 2 ))" \ | ||
| --speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL_PATH\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS, \"attention_backend\": \"FLASH_ATTN\"}" \ | ||
| --stream-interval 20 --no-enable-prefix-caching \ | ||
| --trust-remote-code > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code \ | ||
| --use-chat-template | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4183,3 +4183,12 @@ | |
| - "server_atom.sh: fix _MAX_CONC assignment before cudagraph size check; gate ATOM_MOE_GU_ITLV/AITER_BF16_FP8_MOE_BOUND on DeepSeek-V4-Pro only" | ||
| - "Search space: ISL=8192 and ISL=1024, 1P1D TP4, conc 1-512" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1927 | ||
|
|
||
| - config-keys: | ||
| - minimaxm3-fp4-b300-vllm-mtp | ||
| description: | ||
| - "Add MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) B300 single-node aggregated vLLM benchmark with EAGLE3 speculative decoding (spec-decoding: mtp, 3 draft tokens via Inferact/MiniMax-M3-EAGLE3)" | ||
| - "Image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-7a67223; benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; prompts routed through the chat template" | ||
| - "Target weights pre-staged read-only at /scratch/models/MiniMax-M3-NVFP4 (added MiniMax-M3-NVFP4 to launch_b300-nv.sh STAGED_MODELS); EAGLE3 draft downloaded to the writable /data/models; --block-size 128 (MSA), --language-model-only" | ||
| - "Sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-512" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX | ||
|
Check warning on line 4194 in perf-changelog.yaml
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The perf-changelog entry for Extended reasoning...What the bug is. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The 3-file vLLM patch overlay loop (lines 25-32) runs
curl -fsSL ... -owithoutset -eand without|| exit 1, so a transient 5xx / rate-limit / commit-reachability failure on file #2 (modelopt.py) or file #3 (flashinfer_utils.py) returns non-zero and the loop continues silently. The post-patchpython3 -conly importsTrtLlmNvFp4ExpertsModularfrom file #1, so a partial patch is undetected andvllm serveboots on the comment's "unsupported path". Match the sister recipeminimaxm3_fp8_b300_mtp.sh(which wraps its patch step with|| { echo ...; exit 1; }) — either add|| exit 1to the curl, or assert all three modules import in the verification.Extended reasoning...
The defect. Lines 25-32 fetch three vLLM source files from a pinned
vllm-project/vllmcommit and overwrite them in the installed package:The script has no
set -e(set -xlater is just command tracing) and no|| exit 1inside the loop. Withcurl -f, an HTTP 4xx/5xx exits curl with status 22 and leaves the target file untouched — the original installed copy stays in place.Why the validation doesn't catch it. The post-loop
python3 -cimports onlyTrtLlmNvFp4ExpertsModularfromtrtllm_nvfp4_moe.py(file #1). If file #2 (modelopt.py) or file #3 (flashinfer_utils.py) fails to download, that import still succeeds, the[nvfp4-patch] OKline prints, and the script proceeds tovllm servewith two unpatched modules. Per the script's own preamble: "without it vLLM does not recognise the NVFP4 quant config and falls back to an unsupported path" — exactly the failure modeled by files #2/#3 being unpatched.Step-by-step proof.
curl -fsSLfortrtllm_nvfp4_moe.pysucceeds → file [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 overwritten.curl -fsSLformodelopt.pyhits a transient 503 fromraw.githubusercontent.com→ curl exits 22, no write to${VLLM_DIR}/.../modelopt.py, original vLLM copy retained.set -e, no|| exit 1→ continues to file [NVIDIA] update vllm b200 image. TODO: add logic for docker runner. #3.curl -fsSLforflashinfer_utils.pysucceeds → file [NVIDIA] update vllm b200 image. TODO: add logic for docker runner. #3 overwritten.python3 -cimports fromtrtllm_nvfp4_moe(the file that WAS patched). Import succeeds, prints[nvfp4-patch] OK.vllm serveboots with the newtrtllm_nvfp4_moe.pycalling into an unpatchedmodelopt.py→ NVFP4 quant config not recognized, fallback path triggers, benchmark fails opaquely at serve/inference time instead of at the patch step.Why this is real. Verified that the script contains no
set -[eE]orset -o errexit; the onlyset -einbenchmark_lib.shis scoped torun_agentic_replay_and_write_outputsand doesn't propagate to callers. Verified curl 8.x behavior:-f -o fileagainst a 404/5xx exits non-zero without writing the file. Pinned commit hashes on a stable CDN make this uncommon, but not zero —raw.githubusercontent.comdoes have transient 5xx windows.Repo convention. The sister script
benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300_mtp.sh:33already gates its patch step withpython3 - <<PYEOF || { echo ...; exit 1; }— explicit fail-fast. This recipe should match.Fix. Minimal:
Or assert the other two modules in the verification:
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; from vllm.model_executor.layers.quantization import modelopt; from vllm.model_executor.layers.quantization.utils import flashinfer_utils; print('[nvfp4-patch] OK')"(The second form only catches import-breaking download failures; the first is the strictly safer fix and matches the fp8 twin.)