minimaxm3-fp8-mi355x-vllm-disagg by functionstackx · Pull Request #1762 · SemiAnalysisAI/InferenceX

functionstackx · 2026-06-14T23:12:31Z

What

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) benchmark on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3):

Sweeps conc 1,2,4,8,16,32,64,128,256,512,1024 at both 1k1k and 8k1k, across four prefill/decode TP layouts per scenario:
- 1P TP8 + 1D TP8 — conc 1..1024
- 1P TP4 + 1D TP8 (asymmetric: smaller prefill, full-node decode) — conc 1..256
- 1P TP4 + 1D TP4 (balanced half-node) — conc 64..1024
- 2P TP4 + 1D TP8 (two half-node TP4 prefill workers → one full-node TP8 decode; num-worker 2, PREFILL_NODES=2, 3 nodes total) — conc 256,512,768,1024
Validates the MoRI-IO KV-transfer disagg pipeline end-to-end for M3
The 8k1k run marks one lm-eval (multi-node eval policy: 8k1k + conc ≥ 16) on the highest-max-conc layout (TP8+TP8, eval-conc = median = 128) to validate correctness
Per-worker TP is driven by the master-config prefill/decode.tp: server_vllm.sh sed-rewrites the --tensor-parallel-size 8 placeholder in models_vllm.yaml to the computed PREFILL_TP_SIZE/DECODE_TP_SIZE (TP4 uses half an 8-GPU node; node counts set via PREFILL_NODES/DECODE_NODES — the 1P layouts use 2 nodes, the 2P TP4 layout uses 3)
New config key: minimaxm3-fp8-mi355x-vllm-disagg

Upstream MoRI-IO fixes: all three vLLM PRs merged

This PR runs inter-node disaggregation — prefill node(s) + a decode node, KV transferred across nodes over MoRI-IO. Its correctness (the 8k1k gsm8k eval) depends on MoRIIO fixes that were originally carried here as a runtime overlay against the day-zero minimax-m3 image. Per the upstream plan (tanpinsiang, 2026-06-20), the work was split into three staged vLLM PRs, all staged from this PR (#1762). All three required upstream PRs are now merged:

1. READ-mode mixed KV layouts — MERGED: vLLM #46039 "[ROCm][P/D] Support MiniMax-M3 mixed KV layouts in MoRIIO READ mode" (junkang1991, AMD; tracks vLLM issue #45885). The connector reused the first layer's offsets and assumed a single KV layout, but M3 registers three per-layer formats — separated [2, num_blocks, …], ROCm-interleaved [num_blocks, 2, …], and the rank-3 key-only indexer [num_blocks, block_size, head_dim] — so transfers read the wrong region (invisible to throughput; gsm8k 0.0008 token salad). The fix makes READ offsets per-layer / layout-aware via KVCacheSpec. Merged 2026-06-21; validated intra-node 1P1D TP4+TP4 GSM8K ≈ 0.955.

2. WRITE per-geometry offset caching — MERGED: vLLM #46290 "[ROCm][P/D] Fix MoRIIO WRITE mode for mixed KV layouts" (tanpinsiang). Scope: MoRIIOWriter._prepare_transfer_plan caches WRITE offsets per KV-cache geometry instead of one request-wide offset tuple — the WRITE half this PR's moriio_engine.py overlay already carries. Merged 2026-06-23.

3. Heterogeneous-TP rank mapping + ACK fan-in — MERGED: vLLM #46332 "[ROCm][P/D] Support MoRIIO heterogeneous TP fan-in" (tanpinsiang). Scope: remote TP rank mapping, READ notification target, plain ACK parsing, fan-in ACK counting, duplicate-ACK handling — what makes prefill-TP ≠ decode-TP across nodes work (the het-TP / dup-ACK fixes this PR's overlay carries). Merged 2026-06-23.

Our stop-gap overlay bundles all three fixes so we can reuse the stock minimax-m3 image today: benchmarks/multi_node/amd_utils/patches/moriio/ (moriio_connector.py READ + moriio_engine.py WRITE + moriio_common.py per-geometry cache) + patches/moriio_heterogeneous_kv.py, auto-mounted by job.slurm when DOCKER_IMAGE_NAME contains minimax-m3 (MORIIO_KV_PATCH=skip to disable). Inter-node disagg gsm8k = strict-match 0.9583 / flexible-extract 0.9575, matching single-node. See patches/README.md.

Next unblock step: pick up a published minimax-m3 image that contains #46039, #46290, and #46332; once that image is available and validated, the patches/moriio/ overlay + job.slurm auto-mount can be dropped.

Layered on #1585 (remove vLLM-disagg MoRI patches)

This PR brings in #1585's MoRI-patch-removal infra (that PR is very stale vs main, so the changes are applied selectively rather than by merge):

amd_utils/{setup_deps.sh, server_vllm.sh, submit.sh, models_vllm.yaml} — taken from [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585 (main is untouched here since the merge-base, so these equal main + the mori removal). Includes --all2all-backend mori → mori_low_latency for the existing M2.5/Kimi entries.
amd_utils/job.slurm — [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585's two vLLM-disagg hunks applied onto current main (keeping main's atom-disagg support): vllm-router image nightly-20260511-e667ebb → nightly-20260603-e667ebb, and drop the VLLM_MORIIO_CONNECTOR_READ_MODE env from the vllm-disagg container block.

M3 recipe

benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh — model-agnostic disagg boilerplate (byte-identical to the M2.5 disagg script; the launcher resolves the per-SKU script by name).
models_vllm.yaml MiniMax-M3-MXFP8 — per-worker serve flags: --block-size 128 (MSA sparse/index cache), --language-model-only (text-only benchmark), --kv-cache-dtype fp8 (gfx950), --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (MoE experts TP-sharded as in the single-node M3 recipe). The --tensor-parallel-size 8 is a placeholder rewritten per-worker at launch. Env: VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_BREAKABLE_CUDAGRAPH=0 VLLM_ENGINE_READY_TIMEOUT_S=3600.

Scope guard

perf-changelog.yaml and .github/configs/amd-master.yaml contain only M3 changes vs main.

Validation

YAML parses (models_vllm / amd-master / perf-changelog) ✓
validate_perf_changelog.py append-only gate → 1 appended entry, 0 pr-link corrections ✓
generate_sweep_configs test-config → 6 disagg configs (3 layouts × {1k1k, 8k1k}); exactly 1 run-eval=true, on 8k1k TP8+TP8 with eval-conc 128; all 1k1k entries run-eval=false ✓
launcher routes minimaxm3 / fp8 / vllm-disagg → benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh ✓
process_changelog.py selects minimaxm3-fp8-mi355x-vllm-disagg ✓

🤖 Generated with Claude Code

Note

Medium Risk
Touches disaggregated KV transfer and runtime patching of vLLM inside containers—incorrect offsets or ACK handling would corrupt accuracy or crash engines; benchmark-only scope limits production blast radius.

Overview
Adds minimaxm3-fp8-mi355x-vllm-disagg to amd-master.yaml: multi-node vLLM prefill/decode on vllm/vllm-openai-rocm:minimax-m3, sweeping 1k1k and 8k1k concurrency across four P/D layouts (1P TP8 + 1D TP8, 1P TP4 + 1D TP8, 1P TP4 + 1D TP4, 2P TP4 + 1D TP8), with 8k1k wired for one gsm8k eval on the TP8+TP8 layout.

MoRIIO correctness on the stock image: ships patches/moriio/moriio-minimax-m3-disagg.diff (KV layout, heterogeneous-TP rank mapping, dup-ack fan-in) and job.slurm auto-applies it inside the container for minimax-m3 images before server.sh runs; failed patch aborts the job. Documents the overlay in patches/README.md.

Serving / infra: new models_vllm.yaml MiniMax-M3-MXFP8 recipe and launcher script minimaxm3_fp8_mi355x_vllm-disagg.sh (cluster HF cache path for the ~414GB checkpoint). server_vllm.sh sets MoRIIO read_mode: true in kv_connector_extra_config instead of VLLM_MORIIO_CONNECTOR_READ_MODE. setup_deps.sh drops large in-container Python MoRIIO/scheduler patches (relies on image + unified diff). Kimi/M2.5 disagg flags use mori_low_latency; vllm-router default tag bumped to nightly-20260617-e667ebb. perf-changelog.yaml entry added.

^{Reviewed by Cursor Bugbot for commit 33b3fd2. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-14T23:12:40Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-14T23:12:40Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-14T23:12:40Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-14T23:18:26Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27515117946
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27515117946

github-actions · 2026-06-15T01:37:24Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27515119215
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27515119215

functionstackx · 2026-06-15T01:42:49Z

First sweep failure — diagnosed & fixed

The first disagg sweep (run 27515119215) failed — not a recipe bug. The day-zero MiniMax-M3-MXFP8 checkpoint isn't staged on the MI355X disagg cluster, and the disagg path only searches pre-staged shared-storage paths (no in-container hf download like the single-node recipes):

FATAL: Model 'MiniMax-M3-MXFP8' not found. Searched:
  - /it-share/data/models--MiniMaxAI--MiniMax-M3-MXFP8
  - /it-share/data/MiniMax-M3-MXFP8
  - /nfsdata/hf_hub_cache-0/models--MiniMaxAI--MiniMax-M3-MXFP8
  - /nfsdata/hf_hub_cache-0/MiniMax-M3-MXFP8

server.sh exited immediately; the step then polled the (queued-then-dead) slurm job ~2h before failing.

Fix: amd_utils/job.slurm now auto-downloads the checkpoint when it isn't pre-staged, instead of a hard FATAL:

derives the HF repo id from hf_dir (models--org--name → org/name)
downloads into MODEL_DIR in HF cache layout (keeps MODEL_PATH under the -v ${MODEL_DIR}:/models mount / DOCKER_MODEL_PATH remap)
runs in a one-shot container of the serving image (host has no hf CLI), flock-serialized across prefill/decode nodes, idempotent re-check, 3 retries, huggingface-cli fallback, HF_TOKEN passthrough

Scoped to the vllm-disagg branch; pre-staged models (M2.5/Kimi) never reach this path. Re-running the sweep.

github-actions · 2026-06-15T02:34:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27519206250
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27519206250

github-actions · 2026-06-15T02:54:45Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27520697241
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27520697241

github-actions · 2026-06-15T04:50:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27521167091
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27521167091

…tness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-15T05:31:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27525928087
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27525928087

…tness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image.

MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model: sparse layers register a separate lightning-indexer cache (MLAAttentionSpec, rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5, fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives block geometry from the first cache and reuses first_layer's offsets for every layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is transferred with fp8 K+V sizing and gets corrupted on the decode worker, producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct). This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py. - patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry per layer (own shape/stride/dtype/rank) instead of from the first cache. Idempotent; no-op for homogeneous models. - setup_deps.sh: apply it on the vllm-disagg path. NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO connector cannot map, so M3 disagg accuracy is still broken pending a larger multi-group / index-state transfer change. (Disabling sparse attention is not a viable workaround: M3's fused QKV carries index_k weights, so dropping the indexer breaks weight load.) Refs #1762 Co-authored-by: Cursor <cursoragent@cursor.com>

…max-m3 image The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every disagg block transfer reads the wrong region. Invisible to throughput, but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957). Instead of baking a fix into a rebuilt image (-hetkv) or carrying full vendored copies of the patched files in-tree, carry just the 218-line unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with `patch -p1` against the vLLM package dir inside the container at startup, ahead of the server launch. The repo is already bind-mounted into the container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3" (skippable with MORIIO_KV_PATCH=skip), mirroring the existing mori_conn.py sglang hook. A failed apply aborts the container instead of silently running unpatched. Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode) using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract 0.9560 (matches the baked image within noise), decode probe healthy. - patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock - job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out - patches/README.md: document the moriio/ diff-apply mechanism Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… 8k1k Widen the disagg sweep from conc 1,2,4,8,16 to 1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…and 8k1k Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep, for both seq-len scenarios: - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256 - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024 Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is needed (comment updated to say so). The multinode eval policy still marks exactly one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d MoRIIO diff Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which bundles three layered fixes for the stock minimax-m3 vLLM image: 1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix, required for homogeneous TP too). 2. heterogeneous-TP addressing + guard: maps each decode rank to the correct prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and raises NotImplementedError for unsupported cases (prefill-TP > decode-TP, KV-head splitting) instead of silently corrupting KV. 3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs per transfer_id and only frees KV blocks once all expected consumers ACK, preventing both the late-ACK EngineCore crash and KV reuse before slower decode ranks finish reading. job.slurm and patches/README.md updated to reference the new diff name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc. The previous patch used `return self.tp_rank` for the P>D branch, which made decode rank 1 connect to prefill rank 1 (holds head0) instead of prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0. Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size), the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps each decode rank to the *first* prefill rank of its head group, which holds the correct KV content via vLLM's replication scheme. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ansion The P>D fix added 4 lines to _remote_tp_rank but the hunk header still said +1100,40; patch aborted with "malformed patch at line 79". Update to +1100,44 to match the actual 6 context + 38 added lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The MoRIIO KV-layout patch was injected into the per-node container launch via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer srun bash -c "..." double-quoted string. Because the patch command value contains spaces and the shell operators '<' and '||', the unquoted expansion word-split the generated container script, truncating it right after the word `patch` and silently dropping the patch arguments AND the server.sh launch. The container then exited 0:0 within seconds, producing no benchmark/eval output -> collect_latest_results found "No logs directory" -> the launch step failed with exit 1 (all minimax-m3 disagg jobs affected). Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc single quotes (no quote toggling), so the patch command stays intact and its operators are parsed by the container shell. Validated end-to-end: gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4. Co-authored-by: Cursor <cursoragent@cursor.com>

…1k & 8k1k) Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an 8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still one lm-eval on the 8k1k TP8+TP8 layout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-22T05:26:14Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27916728634
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27916728634

The per-layer READ-offset fix this Python patcher applied to moriio_connector.py is fully subsumed by the unified overlay patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff rewrites the exact lines the patcher searches for (the `first_layer` single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing), with a stronger geometry-memoized + heterogeneous-TP-aware version, so the patcher's OLD1/OLD2 patterns no longer match and it already no-ops ("pattern not found; skipping") in the real flow. It's also the same fix now upstreamed in vLLM #46039 (READ mixed KV layouts). Drop the dead patcher and its setup_deps.sh hook so the diff is the single source of truth. patches/README.md only documents the diff (no reference to this patcher), so no README change is needed. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-06-23T00:00:25Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27968834654
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27968834654

chunfangamd · 2026-06-23T12:13:25Z

@functionstackx All three related PRs have been merged.

PR 1: vllm-project/vllm#46039
PR 2: vllm-project/vllm#46290
PR 3: vllm-project/vllm#46332

- Co-work with Gupta, Ravi All three MoRIIO fixes the in-tree overlay carried have merged upstream and now ship in the ROCm nightly image: - vLLM #46039 READ-mode mixed KV-layout (axis-aware per-layer offsets) - vLLM #46290 WRITE-mode per-geometry offset caching - vLLM #46332 heterogeneous-TP rank mapping + ACK fan-in Point minimaxm3-fp8-mi355x-vllm-disagg at vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15 (vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove the stop-gap overlay: - delete patches/moriio/moriio-minimax-m3-disagg.diff - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate) - trim the moriio/ section from patches/README.md Verified on the nightly image with NO patch across all four P/D layouts x conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4, 2P4+1D8) -- matching the previously-patched results. Refs #1762.

…-disagg

github-actions · 2026-06-24T15:51:17Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28101174324
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28101174324

github-actions · 2026-06-24T22:11:24Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28101174324
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28101174324

functionstackx · 2026-06-24T22:30:45Z

/reuse-sweep-run

The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after the #1862 entry), which violates the append-only changelog gate ("entry 511 changed; existing entries are immutable"). Move it to the end of perf-changelog.yaml so existing entries stay byte-identical to main and the new entry is a clean append. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

functionstackx · 2026-06-24T22:39:32Z

/reuse-sweep-run

functionstackx requested a review from a team June 14, 2026 23:12

functionstackx requested review from billishyahao and chunfangamd as code owners June 14, 2026 23:12

github-project-automation Bot added this to InferenceMAX Board Jun 14, 2026

functionstackx requested review from 1am9trash, seungrokj and yctseng0211 as code owners June 14, 2026 23:12

functionstackx added the sweep-enabled label Jun 14, 2026

functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 8118fa3 to a4f66bd Compare June 15, 2026 01:45

functionstackx changed the title ~~[Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test (1P TP8 + 1D TP8, conc 1)~~ [Klaud Cold][Experimental][DNM] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test (1P TP8 + 1D TP8, conc 1) Jun 15, 2026

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/job.slurm Outdated

functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from a4f66bd to 409561f Compare June 15, 2026 02:37

cursor Bot mentioned this pull request Jun 15, 2026

[AMD] dsv4-fp4-mi355x-sglang: switch fixed-seq-len search space to TP4 #1768

Merged

2 tasks

functionstackx added non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) and removed sweep-enabled labels Jun 15, 2026

functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 7b33cf1 to 01ed5b8 Compare June 15, 2026 05:25

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread benchmarks/multi_node/amd_utils/setup_deps.sh

functionstackx and others added 13 commits June 21, 2026 16:35

disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1)

84c8d8e

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Update the vLLM external router container

299c401

vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image.

functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 1677806 to aad872a Compare June 21, 2026 20:36

chunfangamd removed the evals-only Suppress throughput and run only eval jobs; combine with all-evals to expand selection label Jun 22, 2026

Merge branch 'main' into feat/minimax-m3-mi355-disagg

33b3fd2

chunfangamd mentioned this pull request Jun 23, 2026

[AMD] Add MiniMax-M3-FP8 MI355X ATOMESH #1865

Merged

3 tasks

chunfangamd added 2 commits June 24, 2026 12:55

Merge remote-tracking branch 'origin/main' into feat/minimax-m3-mi355…

c9e2d56

…-disagg

Merge remote-tracking branch 'origin/main' into pr-1762-reuse-27706

5a68bc9

billishyahao mentioned this pull request Jun 25, 2026

[AMD] Add MiniMax-M3-FP8 MI355X ATOM EAGLE3 / non-EAGLE3 update 0623 #1916

Open

8 tasks

Uh oh!

Conversation

functionstackx commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Upstream MoRI-IO fixes: all three vLLM PRs merged

Layered on #1585 (remove vLLM-disagg MoRI patches)

M3 recipe

Scope guard

Validation

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

functionstackx commented Jun 15, 2026

First sweep failure — diagnosed & fixed

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

chunfangamd commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

functionstackx commented Jun 24, 2026

Uh oh!

functionstackx commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

functionstackx commented Jun 14, 2026 •

edited

Loading