Skip to content

[AMD] Add MiniMax-M3-FP4 MI355X ATOMESH update 0623#1930

Open
seungrokj wants to merge 4 commits into
mainfrom
amd/m3_atom_pd_fp8_0623
Open

[AMD] Add MiniMax-M3-FP4 MI355X ATOMESH update 0623#1930
seungrokj wants to merge 4 commits into
mainfrom
amd/m3_atom_pd_fp8_0623

Conversation

@seungrokj

@seungrokj seungrokj commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Eliminate all hardcoded MODEL_NAME == "DeepSeek-V4-Pro" / per-model checks from server_atom.sh
  • All model-specific configuration (env vars, parallel flags, MTP flags, KV cache flags, HF overrides) now driven from models_atom.yaml using the same python3 yaml.safe_load pattern as server_vllm.sh
  • Add MiniMax-M3-MXFP4 and MiniMax-M3-MXFP8 entries to models_atom.yaml with EAGLE3 MTP flags
  • Image bump for minimaxm3-fp8-mi355x-atom-disagg: rocm/atom-dev:MiniMax-M3-20260622rocm/atom-dev:MiniMax-M3-20260623

Fields added to models_atom.yaml

Field Purpose
env Space-separated KEY=VALUE pairs exported unconditionally
tp_dp_flags Parallel flags for TP+DPA mode
tp_dp_env Env vars exported only in TP+DPA mode
ep_dp_flags Parallel flags for EP+DPA mode
ep_dp_env Env vars exported only in EP+DPA mode
mtp_flags Flags prepended to SPEC_ARGS before $DECODE_MTP_SIZE
kv_cache_flags Full --kv_cache_dtype flag string
hf_overrides JSON string passed to --hf-overrides

PR Review Checklist

  • Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
  • Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this.
  • Verified that this PR passes evals. Please link to GitHub Action workflow that shows this.
  • Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
  • If a company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
  • Verified that the single-node recipes are similar to the official vLLM recipes and/or the SGLang cookbook:
    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
  • If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

🤖 Generated with Claude Code

seungrokj and others added 2 commits June 25, 2026 14:39
…els_atom.yaml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…om.yaml-driven)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@seungrokj seungrokj changed the title [AMD] refactor server_atom.sh: drive model-specific config from models_atom.yaml [AMD] Add MiniMax-M3-FP4 MI355X ATOMESH update 0623 Jun 25, 2026
@seungrokj seungrokj added AMD all-evals Expand eval selection to every fixed-sequence config evals-only Suppress throughput and run only eval jobs; combine with all-evals to expand selection full-sweep-enabled labels Jun 25, 2026
Comment thread benchmarks/multi_node/amd_utils/models_atom.yaml
Comment on lines +80 to +94
# =============================================================================
# Model-Specific Configuration from YAML
# =============================================================================
# Load model-specific config from YAML (single parse for all fields)
eval "$(python3 -c "
import yaml
with open('${ATOM_WS_PATH}/models_atom.yaml') as f:
m = yaml.safe_load(f).get('${MODEL_NAME}', {})
print(f'MODEL_ENVS=\"{m.get(\"env\", \"\")}\"')
print(f'MODEL_TP_DP_FLAGS=\"{m.get(\"tp_dp_flags\", \"\")}\"')
print(f'MODEL_EP_DP_FLAGS=\"{m.get(\"ep_dp_flags\", \"\")}\"')
print(f'MODEL_TP_DP_ENV=\"{m.get(\"tp_dp_env\", \"\")}\"')
print(f'MODEL_EP_DP_ENV=\"{m.get(\"ep_dp_env\", \"\")}\"')
print(f'MODEL_MTP_FLAGS=\"{m.get(\"mtp_flags\", \"\")}\"')
print(f'MODEL_KV_ARG=\"{m.get(\"kv_cache_flags\", \"\")}\"')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 hf_overrides JSON quotes stripped by bash eval — breaks DeepSeek-V4-Pro startup. The Python f-string print(f'_HF_OVERRIDES="{m.get("hf_overrides", "")}"') interpolates the raw YAML value ({"use_index_cache":true,"index_topk_freq":4}) into an outer double-quoted bash assignment, so the inner " characters terminate the outer string. After eval, _HF_OVERRIDES holds {use_index_cache:true,index_topk_freq:4} — unquoted keys — and --hf-overrides '{use_index_cache:true,index_topk_freq:4}' is rejected as invalid JSON, blocking the documented Verify server_atom.sh launches correctly for DeepSeek-V4-Pro test plan item. Fix: emit the value with shlex.quote() (or json.dumps) in the Python so the JSON survives the bash eval intact.

Extended reasoning...

What the bug is

The new YAML-driven loader at server_atom.sh:80-94 runs a Python snippet whose output is captured and eval'd as bash. The hf_overrides field for DeepSeek-V4-Pro in models_atom.yaml is the JSON string {"use_index_cache":true,"index_topk_freq":4}. The emitter line is:

print(f'_HF_OVERRIDES="{m.get("hf_overrides", "")}"')

For DeepSeek-V4-Pro this prints literally:

_HF_OVERRIDES="{"use_index_cache":true,"index_topk_freq":4}"

The inner unescaped " characters terminate the outer double-quoted region. Bash treats the line as a concatenation of quoted and unquoted word segments — "{" + use_index_cache + ":true," + index_topk_freq + ":4}" — and assembles them into a single word with all the quotes stripped.

Step-by-step proof

Reproduced locally with the exact Python emitter and bash eval:

$ python3 -c 'm={"hf_overrides": "{\"use_index_cache\":true,\"index_topk_freq\":4}"}; print(f"_HF_OVERRIDES=\"{m.get(\"hf_overrides\", \"\")}\"")'
_HF_OVERRIDES="{"use_index_cache":true,"index_topk_freq":4}"

$ eval '_HF_OVERRIDES="{"use_index_cache":true,"index_topk_freq":4}"' && echo "[$_HF_OVERRIDES]"
[{use_index_cache:true,index_topk_freq:4}]

$ python3 -c 'import json; json.loads("{use_index_cache:true,index_topk_freq:4}")'
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

After the eval, the downstream line

HF_OVERRIDES_ARG="--hf-overrides '${_HF_OVERRIDES}'"

produces --hf-overrides '{use_index_cache:true,index_topk_freq:4}' — an invalid JSON literal with unquoted keys. The atom server's argparse / json.loads on --hf-overrides will reject this at startup.

Why existing code doesn't prevent it

The pre-PR code hard-coded the value as a bash string with backslash-escaped inner quotes:
HF_OVERRIDES_ARG="--hf-overrides '{\"use_index_cache\":true,\"index_topk_freq\":4}'"
That escaping is exactly what survives bash parsing, and it is what the new YAML-driven path loses. DeepSeek-V4-Pro is the only model in models_atom.yaml with a non-empty hf_overrides (the other models' YAML fields contain no " characters, so they are unaffected); the other emitted assignments (env, tp_dp_flags, etc.) are safe.

Impact

This regresses the dsv4-fp4-mi355x-atom-disagg recipe in amd-master.yaml (which sets MODEL_NAME=DeepSeek-V4-Pro and routes through server_atom.sh). The server will fail at startup when atom's argparse calls json.loads on the --hf-overrides argument — and this is precisely the path the PR's own test plan flags (Verify server_atom.sh launches correctly for DeepSeek-V4-Pro).

Fix

Quote the value in the Python emitter so the bash eval sees a properly-escaped literal. Either:

import shlex
print(f'_HF_OVERRIDES={shlex.quote(m.get("hf_overrides", ""))}')

(produces _HF_OVERRIDES='{"use_index_cache":true,"index_topk_freq":4}', which bash parses correctly), or write each value to a NUL-delimited side channel that bash reads with read -d '' instead of evaling arbitrary Python output.

Comment on lines 193 to +195
MTP : method=mtp num_speculative_tokens=${DECODE_MTP_SIZE}
xP/yD : ${xP} / ${yD}
KV cache : dtype=${KV_CACHE_DTYPE:-auto} block_size=${BLOCK_SIZE} mem_frac=${MEM_FRAC_STATIC}
KV cache : ${KV_CACHE_ARG:-none} block_size=${BLOCK_SIZE} mem_frac=${MEM_FRAC_STATIC}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: line 193's INFO banner prints the literal string MTP : method=mtp num_speculative_tokens=${DECODE_MTP_SIZE}, but mtp_flags is now YAML-driven and MiniMax-M3-MXFP4/MXFP8 use --method eagle3 --draft-model Inferact/MiniMax-M3-EAGLE3. When those models run with SPEC_DECODING=mtp, the banner will misleadingly claim method=mtp. Pure log/cosmetic — the Spec args : ${SPEC_ARGS[*]} line immediately below prints the actual flags. Suggest dropping the hardcoded method=mtp (the Spec args line already covers it) or replacing with ${MODEL_MTP_FLAGS}.

Extended reasoning...

What\n\nIn benchmarks/multi_node/amd_utils/server_atom.sh the === Configuration === heredoc still contains a hardcoded line:\n\n\nMTP : method=mtp num_speculative_tokens=${DECODE_MTP_SIZE}\n\n\nThis line predates the YAML-driven refactor in this PR. With the new mtp_flags field, MODEL_MTP_FLAGS can be anything — for the two new MiniMax-M3 entries it is --method eagle3 --draft-model Inferact/MiniMax-M3-EAGLE3 --num-speculative-tokens. So the banner can advertise method=mtp even when the server is actually being launched with EAGLE3 flags.\n\n### Step-by-step proof\n\n1. Set MODEL_NAME=MiniMax-M3-MXFP8, SPEC_DECODING=mtp, DECODE_MTP_SIZE=2.\n2. The YAML block (models_atom.yaml:34-39) supplies mtp_flags: --method eagle3 --draft-model Inferact/MiniMax-M3-EAGLE3 --num-speculative-tokens, so MODEL_MTP_FLAGS is the EAGLE3 string.\n3. server_atom.sh:174-176 builds SPEC_ARGS=(${MODEL_MTP_FLAGS} "$DECODE_MTP_SIZE") → the actual server is launched with --method eagle3 --draft-model … --num-speculative-tokens 2.\n4. server_atom.sh:193 still prints MTP : method=mtp num_speculative_tokens=2 — factually wrong about method.\n5. server_atom.sh:199 immediately below prints Spec args : --method eagle3 --draft-model Inferact/MiniMax-M3-EAGLE3 --num-speculative-tokens 2, which is correct.\n\n### Why this doesn't break anything\n\nThe SPEC_ARGS array (the actual flags passed to python3 -m atom.entrypoints.openai_server) is built correctly from MODEL_MTP_FLAGS; the bug is exclusively in the banner. The Spec args line two lines below shows the truth, so an operator inspecting the log can see the real flags. Also note: the new minimaxm3-fp8-mi355x-atom-disagg recipe sets SPEC_DECODING=none and DECODE_MTP_SIZE=0, so the misleading line is not exercised by anything this PR enables today — it's a latent issue for whenever a MiniMax-M3 atom-disagg recipe is added with MTP on.\n\n### How to fix\n\nReplace line 193 with one of:\n\n\nMTP : ${MODEL_MTP_FLAGS} ${DECODE_MTP_SIZE}\n\n\nor simply drop the line — Spec args : ${SPEC_ARGS[*]} directly below already covers the same information accurately. Severity is nit because it is purely cosmetic and the correct values are visible one line lower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all-evals Expand eval selection to every fixed-sequence config AMD evals-only Suppress throughput and run only eval jobs; combine with all-evals to expand selection full-sweep-enabled

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant