-
Notifications
You must be signed in to change notification settings - Fork 208
Add MiniMax-M3 NVFP4 B300 single-node vLLM benchmark (EAGLE3 spec decode) #1929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
e0d970e
ec78fc5
f170554
98b33d0
15fdb0f
7d39420
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # MiniMax-M3 NVFP4 B300 single-node vLLM recipe with EAGLE3 speculative | ||
| # decoding — same shape as minimaxm3_fp8_b300_mtp.sh but uses the | ||
| # nvidia/MiniMax-M3-NVFP4 checkpoint. Applies vllm-project/vllm PR #46380 | ||
| # (MiniMax-M3 modelopt NVFP4 support) from commit 6c08558 by overwriting the | ||
| # 3 changed files in the installed vLLM package before the server starts. | ||
|
|
||
| source "$(dirname "$0")/../../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| EP_SIZE \ | ||
| DP_ATTENTION \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| # Apply vllm-project/vllm PR #46380 (Add MiniMax-M3 modelopt NVFP4 support, commit 6c08558). | ||
| # This patch is required for nvidia/MiniMax-M3-NVFP4: without it vLLM does not | ||
| # recognise the NVFP4 quant config and falls back to an unsupported path. | ||
| VLLM_DIR=$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))") | ||
| for f in \ | ||
| model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \ | ||
| model_executor/layers/quantization/modelopt.py \ | ||
| model_executor/layers/quantization/utils/flashinfer_utils.py | ||
| do | ||
| curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The 3-file vLLM patch overlay loop (lines 25-32) runs Extended reasoning...The defect. Lines 25-32 fetch three vLLM source files from a pinned for f in \
model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
model_executor/layers/quantization/modelopt.py \
model_executor/layers/quantization/utils/flashinfer_utils.py
do
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"The script has no Why the validation doesn't catch it. The post-loop Step-by-step proof.
Why this is real. Verified that the script contains no Repo convention. The sister script Fix. Minimal: curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" || exit 1Or assert the other two modules in the verification: python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; from vllm.model_executor.layers.quantization import modelopt; from vllm.model_executor.layers.quantization.utils import flashinfer_utils; print('[nvfp4-patch] OK')"(The second form only catches import-breaking download failures; the first is the strictly safer fix and matches the fp8 twin.) |
||
| done | ||
| python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')" | ||
|
|
||
| DRAFT_MODEL="Inferact/MiniMax-M3-EAGLE3" | ||
|
|
||
| # The target weights are launched from MODEL_PATH (the b300 launcher points it | ||
| # at the pre-staged read-only /scratch/models/MiniMax-M3-NVFP4). The EAGLE3 | ||
| # draft is not pre-staged and must be downloaded, so it cannot live next to the | ||
| # read-only target — fetch it into the writable models dir (/data/models) | ||
| # instead. When MODEL_PATH is unset (stand-alone runs) fall back to the HF cache. | ||
| if [[ -n "${MODEL_PATH:-}" ]]; then | ||
| if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then | ||
| hf download "$MODEL" --local-dir "$MODEL_PATH" | ||
| fi | ||
| DRAFT_MODEL_PATH="/data/models/${DRAFT_MODEL##*/}" | ||
| if [[ ! -d "$DRAFT_MODEL_PATH" || -z "$(ls -A "$DRAFT_MODEL_PATH" 2>/dev/null)" ]]; then | ||
| hf download "$DRAFT_MODEL" --local-dir "$DRAFT_MODEL_PATH" | ||
| fi | ||
| else | ||
| hf download "$MODEL" | ||
| export MODEL_PATH="$MODEL" | ||
| hf download "$DRAFT_MODEL" | ||
| DRAFT_MODEL_PATH="$DRAFT_MODEL" | ||
| fi | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
|
|
||
| export VLLM_ENGINE_READY_TIMEOUT_S=3600 | ||
| export VLLM_FLOAT32_MATMUL_PRECISION=high | ||
|
|
||
| if [ "${DP_ATTENTION}" = "true" ]; then | ||
| PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel" | ||
| elif [ "$EP_SIZE" -gt 1 ]; then | ||
| PARALLEL_ARGS="--tensor-parallel-size=$TP --enable-expert-parallel" | ||
| else | ||
| PARALLEL_ARGS="--tensor-parallel-size=$TP" | ||
| fi | ||
|
|
||
| # use 3 speculative tokens for all configs for now | ||
| NUM_SPEC_TOKENS=3 | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve "$MODEL_PATH" --served-model-name "$MODEL" --host 0.0.0.0 --port $PORT \ | ||
| $PARALLEL_ARGS \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --block-size 128 \ | ||
| --language-model-only \ | ||
| --max-cudagraph-capture-size 2048 \ | ||
| --max-num-batched-tokens "$((ISL * 2 ))" \ | ||
| --speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL_PATH\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS, \"attention_backend\": \"FLASH_ATTN\"}" \ | ||
| --stream-interval 20 --no-enable-prefix-caching \ | ||
| --trust-remote-code > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code \ | ||
| --use-chat-template | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I get new model and we trying to go SOL but can we at least wait til this is merged and it makes the nightly?
Or some other official image at least.