Skip to content

minimaxm3-fp8-mi355x-vllm-disagg#1762

Merged
functionstackx merged 23 commits into
mainfrom
feat/minimax-m3-mi355-disagg
Jun 24, 2026
Merged

minimaxm3-fp8-mi355x-vllm-disagg#1762
functionstackx merged 23 commits into
mainfrom
feat/minimax-m3-mi355-disagg

Conversation

@functionstackx

@functionstackx functionstackx commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

What

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) benchmark on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3):

  • Sweeps conc 1,2,4,8,16,32,64,128,256,512,1024 at both 1k1k and 8k1k, across four prefill/decode TP layouts per scenario:
    • 1P TP8 + 1D TP8 — conc 1..1024
    • 1P TP4 + 1D TP8 (asymmetric: smaller prefill, full-node decode) — conc 1..256
    • 1P TP4 + 1D TP4 (balanced half-node) — conc 64..1024
    • 2P TP4 + 1D TP8 (two half-node TP4 prefill workers → one full-node TP8 decode; num-worker 2, PREFILL_NODES=2, 3 nodes total) — conc 256,512,768,1024
  • Validates the MoRI-IO KV-transfer disagg pipeline end-to-end for M3
  • The 8k1k run marks one lm-eval (multi-node eval policy: 8k1k + conc ≥ 16) on the highest-max-conc layout (TP8+TP8, eval-conc = median = 128) to validate correctness
  • Per-worker TP is driven by the master-config prefill/decode.tp: server_vllm.sh sed-rewrites the --tensor-parallel-size 8 placeholder in models_vllm.yaml to the computed PREFILL_TP_SIZE/DECODE_TP_SIZE (TP4 uses half an 8-GPU node; node counts set via PREFILL_NODES/DECODE_NODES — the 1P layouts use 2 nodes, the 2P TP4 layout uses 3)
  • New config key: minimaxm3-fp8-mi355x-vllm-disagg

Upstream MoRI-IO fixes: all three vLLM PRs merged

This PR runs inter-node disaggregation — prefill node(s) + a decode node, KV transferred across nodes over MoRI-IO. Its correctness (the 8k1k gsm8k eval) depends on MoRIIO fixes that were originally carried here as a runtime overlay against the day-zero minimax-m3 image. Per the upstream plan (tanpinsiang, 2026-06-20), the work was split into three staged vLLM PRs, all staged from this PR (#1762). All three required upstream PRs are now merged:

1. READ-mode mixed KV layouts — MERGED: vLLM #46039 "[ROCm][P/D] Support MiniMax-M3 mixed KV layouts in MoRIIO READ mode" (junkang1991, AMD; tracks vLLM issue #45885). The connector reused the first layer's offsets and assumed a single KV layout, but M3 registers three per-layer formats — separated [2, num_blocks, …], ROCm-interleaved [num_blocks, 2, …], and the rank-3 key-only indexer [num_blocks, block_size, head_dim] — so transfers read the wrong region (invisible to throughput; gsm8k 0.0008 token salad). The fix makes READ offsets per-layer / layout-aware via KVCacheSpec. Merged 2026-06-21; validated intra-node 1P1D TP4+TP4 GSM8K ≈ 0.955.

2. WRITE per-geometry offset caching — MERGED: vLLM #46290 "[ROCm][P/D] Fix MoRIIO WRITE mode for mixed KV layouts" (tanpinsiang). Scope: MoRIIOWriter._prepare_transfer_plan caches WRITE offsets per KV-cache geometry instead of one request-wide offset tuple — the WRITE half this PR's moriio_engine.py overlay already carries. Merged 2026-06-23.

3. Heterogeneous-TP rank mapping + ACK fan-in — MERGED: vLLM #46332 "[ROCm][P/D] Support MoRIIO heterogeneous TP fan-in" (tanpinsiang). Scope: remote TP rank mapping, READ notification target, plain ACK parsing, fan-in ACK counting, duplicate-ACK handling — what makes prefill-TP ≠ decode-TP across nodes work (the het-TP / dup-ACK fixes this PR's overlay carries). Merged 2026-06-23.

Our stop-gap overlay bundles all three fixes so we can reuse the stock minimax-m3 image today: benchmarks/multi_node/amd_utils/patches/moriio/ (moriio_connector.py READ + moriio_engine.py WRITE + moriio_common.py per-geometry cache) + patches/moriio_heterogeneous_kv.py, auto-mounted by job.slurm when DOCKER_IMAGE_NAME contains minimax-m3 (MORIIO_KV_PATCH=skip to disable). Inter-node disagg gsm8k = strict-match 0.9583 / flexible-extract 0.9575, matching single-node. See patches/README.md.

Next unblock step: pick up a published minimax-m3 image that contains #46039, #46290, and #46332; once that image is available and validated, the patches/moriio/ overlay + job.slurm auto-mount can be dropped.

Layered on #1585 (remove vLLM-disagg MoRI patches)

This PR brings in #1585's MoRI-patch-removal infra (that PR is very stale vs main, so the changes are applied selectively rather than by merge):

  • amd_utils/{setup_deps.sh, server_vllm.sh, submit.sh, models_vllm.yaml} — taken from [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks  #1585 (main is untouched here since the merge-base, so these equal main + the mori removal). Includes --all2all-backend morimori_low_latency for the existing M2.5/Kimi entries.
  • amd_utils/job.slurm[Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks  #1585's two vLLM-disagg hunks applied onto current main (keeping main's atom-disagg support): vllm-router image nightly-20260511-e667ebbnightly-20260603-e667ebb, and drop the VLLM_MORIIO_CONNECTOR_READ_MODE env from the vllm-disagg container block.

M3 recipe

  • benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh — model-agnostic disagg boilerplate (byte-identical to the M2.5 disagg script; the launcher resolves the per-SKU script by name).
  • models_vllm.yaml MiniMax-M3-MXFP8 — per-worker serve flags: --block-size 128 (MSA sparse/index cache), --language-model-only (text-only benchmark), --kv-cache-dtype fp8 (gfx950), --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (MoE experts TP-sharded as in the single-node M3 recipe). The --tensor-parallel-size 8 is a placeholder rewritten per-worker at launch. Env: VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_BREAKABLE_CUDAGRAPH=0 VLLM_ENGINE_READY_TIMEOUT_S=3600.

Scope guard

perf-changelog.yaml and .github/configs/amd-master.yaml contain only M3 changes vs main.

Validation

  • YAML parses (models_vllm / amd-master / perf-changelog) ✓
  • validate_perf_changelog.py append-only gate → 1 appended entry, 0 pr-link corrections
  • generate_sweep_configs test-config6 disagg configs (3 layouts × {1k1k, 8k1k}); exactly 1 run-eval=true, on 8k1k TP8+TP8 with eval-conc 128; all 1k1k entries run-eval=false
  • launcher routes minimaxm3 / fp8 / vllm-disaggbenchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh
  • process_changelog.py selects minimaxm3-fp8-mi355x-vllm-disagg

🤖 Generated with Claude Code


Note

Medium Risk
Touches disaggregated KV transfer and runtime patching of vLLM inside containers—incorrect offsets or ACK handling would corrupt accuracy or crash engines; benchmark-only scope limits production blast radius.

Overview
Adds minimaxm3-fp8-mi355x-vllm-disagg to amd-master.yaml: multi-node vLLM prefill/decode on vllm/vllm-openai-rocm:minimax-m3, sweeping 1k1k and 8k1k concurrency across four P/D layouts (1P TP8 + 1D TP8, 1P TP4 + 1D TP8, 1P TP4 + 1D TP4, 2P TP4 + 1D TP8), with 8k1k wired for one gsm8k eval on the TP8+TP8 layout.

MoRIIO correctness on the stock image: ships patches/moriio/moriio-minimax-m3-disagg.diff (KV layout, heterogeneous-TP rank mapping, dup-ack fan-in) and job.slurm auto-applies it inside the container for minimax-m3 images before server.sh runs; failed patch aborts the job. Documents the overlay in patches/README.md.

Serving / infra: new models_vllm.yaml MiniMax-M3-MXFP8 recipe and launcher script minimaxm3_fp8_mi355x_vllm-disagg.sh (cluster HF cache path for the ~414GB checkpoint). server_vllm.sh sets MoRIIO read_mode: true in kv_connector_extra_config instead of VLLM_MORIIO_CONNECTOR_READ_MODE. setup_deps.sh drops large in-container Python MoRIIO/scheduler patches (relies on image + unified diff). Kimi/M2.5 disagg flags use mori_low_latency; vllm-router default tag bumped to nightly-20260617-e667ebb. perf-changelog.yaml entry added.

Reviewed by Cursor Bugbot for commit 33b3fd2. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx

Copy link
Copy Markdown
Collaborator Author

First sweep failure — diagnosed & fixed

The first disagg sweep (run 27515119215) failed — not a recipe bug. The day-zero MiniMax-M3-MXFP8 checkpoint isn't staged on the MI355X disagg cluster, and the disagg path only searches pre-staged shared-storage paths (no in-container hf download like the single-node recipes):

FATAL: Model 'MiniMax-M3-MXFP8' not found. Searched:
  - /it-share/data/models--MiniMaxAI--MiniMax-M3-MXFP8
  - /it-share/data/MiniMax-M3-MXFP8
  - /nfsdata/hf_hub_cache-0/models--MiniMaxAI--MiniMax-M3-MXFP8
  - /nfsdata/hf_hub_cache-0/MiniMax-M3-MXFP8

server.sh exited immediately; the step then polled the (queued-then-dead) slurm job ~2h before failing.

Fix: amd_utils/job.slurm now auto-downloads the checkpoint when it isn't pre-staged, instead of a hard FATAL:

  • derives the HF repo id from hf_dir (models--org--nameorg/name)
  • downloads into MODEL_DIR in HF cache layout (keeps MODEL_PATH under the -v ${MODEL_DIR}:/models mount / DOCKER_MODEL_PATH remap)
  • runs in a one-shot container of the serving image (host has no hf CLI), flock-serialized across prefill/decode nodes, idempotent re-check, 3 retries, huggingface-cli fallback, HF_TOKEN passthrough

Scoped to the vllm-disagg branch; pre-staged models (M2.5/Kimi) never reach this path. Re-running the sweep.

@functionstackx functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 8118fa3 to a4f66bd Compare June 15, 2026 01:45
@functionstackx functionstackx changed the title [Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test (1P TP8 + 1D TP8, conc 1) [Klaud Cold][Experimental][DNM] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test (1P TP8 + 1D TP8, conc 1) Jun 15, 2026
Comment thread benchmarks/multi_node/amd_utils/job.slurm Outdated
@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from a4f66bd to 409561f Compare June 15, 2026 02:37
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx added non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) and removed sweep-enabled labels Jun 15, 2026
functionstackx added a commit that referenced this pull request Jun 15, 2026
…tness)

The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy
only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row
(same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true
(eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate
correctness. The conc-1 1k1k row stays the latency smoke test.

Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 7b33cf1 to 01ed5b8 Compare June 15, 2026 05:25
Comment thread benchmarks/multi_node/amd_utils/setup_deps.sh
functionstackx added a commit that referenced this pull request Jun 15, 2026
Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16
(1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 15, 2026
Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios
(1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked
(eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

functionstackx and others added 13 commits June 21, 2026 16:35
…tness)

The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy
only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row
(same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true
(eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate
correctness. The conc-1 1k1k row stays the latency smoke test.

Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16
(1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios
(1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked
(eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older
dated tags are garbage-collected (manifest unknown), which makes `docker run`
fail with exit 125 on any node that has not already cached the image.
MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model:
sparse layers register a separate lightning-indexer cache (MLAAttentionSpec,
rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5,
fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives
block geometry from the first cache and reuses first_layer's offsets for every
layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is
transferred with fp8 K+V sizing and gets corrupted on the decode worker,
producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct).
This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py.

- patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry
  per layer (own shape/stride/dtype/rank) instead of from the first cache.
  Idempotent; no-op for homogeneous models.
- setup_deps.sh: apply it on the vllm-disagg path.

NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a
separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO
connector cannot map, so M3 disagg accuracy is still broken pending a larger
multi-group / index-state transfer change. (Disabling sparse attention is not a
viable workaround: M3's fused QKV carries index_k weights, so dropping the
indexer breaks weight load.)

Refs #1762

Co-authored-by: Cursor <cursoragent@cursor.com>
…max-m3 image

The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the
FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this
vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every
disagg block transfer reads the wrong region. Invisible to throughput,
but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957).

Instead of baking a fix into a rebuilt image (-hetkv) or carrying full
vendored copies of the patched files in-tree, carry just the 218-line
unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with
`patch -p1` against the vLLM package dir inside the container at startup,
ahead of the server launch. The repo is already bind-mounted into the
container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm
auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3"
(skippable with MORIIO_KV_PATCH=skip), mirroring the existing
mori_conn.py sglang hook. A failed apply aborts the container instead of
silently running unpatched.

Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode)
using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract
0.9560 (matches the baked image within noise), decode probe healthy.

- patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock
- job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out
- patches/README.md: document the moriio/ diff-apply mechanism

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… 8k1k

Widen the disagg sweep from conc 1,2,4,8,16 to
1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D
TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16)
so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…and 8k1k

Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep,
for both seq-len scenarios:
  - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256
  - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024

Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh
sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the
computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is
needed (comment updated to say so). The multinode eval policy still marks exactly
one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d MoRIIO diff

Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which
bundles three layered fixes for the stock minimax-m3 vLLM image:
1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix,
   required for homogeneous TP too).
2. heterogeneous-TP addressing + guard: maps each decode rank to the correct
   prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and
   raises NotImplementedError for unsupported cases (prefill-TP > decode-TP,
   KV-head splitting) instead of silently corrupting KV.
3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs
   per transfer_id and only frees KV blocks once all expected consumers ACK,
   preventing both the late-ACK EngineCore crash and KV reuse before slower
   decode ranks finish reading.

job.slurm and patches/README.md updated to reference the new diff name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks
in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc.
The previous patch used `return self.tp_rank` for the P>D branch, which
made decode rank 1 connect to prefill rank 1 (holds head0) instead of
prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0.

Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size),
the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps
each decode rank to the *first* prefill rank of its head group, which holds
the correct KV content via vLLM's replication scheme.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ansion

The P>D fix added 4 lines to _remote_tp_rank but the hunk header still
said +1100,40; patch aborted with "malformed patch at line 79". Update
to +1100,44 to match the actual 6 context + 38 added lines.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The MoRIIO KV-layout patch was injected into the per-node container launch
via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer
srun bash -c "..." double-quoted string. Because the patch command value
contains spaces and the shell operators '<' and '||', the unquoted
expansion word-split the generated container script, truncating it right
after the word `patch` and silently dropping the patch arguments AND the
server.sh launch. The container then exited 0:0 within seconds, producing
no benchmark/eval output -> collect_latest_results found "No logs
directory" -> the launch step failed with exit 1 (all minimax-m3 disagg
jobs affected).

Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc
single quotes (no quote toggling), so the patch command stays intact and
its operators are parsed by the container shell. Validated end-to-end:
gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4.

Co-authored-by: Cursor <cursoragent@cursor.com>
…1k & 8k1k)

Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an
8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to
both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still
one lm-eval on the 8k1k TP8+TP8 layout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the feat/minimax-m3-mi355-disagg branch from 1677806 to aad872a Compare June 21, 2026 20:36
@github-actions

Copy link
Copy Markdown
Contributor

The per-layer READ-offset fix this Python patcher applied to
moriio_connector.py is fully subsumed by the unified overlay
patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies
with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff
rewrites the exact lines the patcher searches for (the `first_layer`
single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing),
with a stronger geometry-memoized + heterogeneous-TP-aware version, so
the patcher's OLD1/OLD2 patterns no longer match and it already no-ops
("pattern not found; skipping") in the real flow. It's also the same
fix now upstreamed in vLLM #46039 (READ mixed KV layouts).

Drop the dead patcher and its setup_deps.sh hook so the diff is the
single source of truth. patches/README.md only documents the diff (no
reference to this patcher), so no README change is needed.

Co-authored-by: Cursor <cursoragent@cursor.com>
@chunfangamd chunfangamd removed the evals-only Suppress throughput and run only eval jobs; combine with all-evals to expand selection label Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

@chunfangamd

Copy link
Copy Markdown
Collaborator

@functionstackx All three related PRs have been merged.

PR 1: vllm-project/vllm#46039
PR 2: vllm-project/vllm#46290
PR 3: vllm-project/vllm#46332

- Co-work with Gupta, Ravi

All three MoRIIO fixes the in-tree overlay carried have merged upstream and now
ship in the ROCm nightly image:
  - vLLM #46039  READ-mode mixed KV-layout (axis-aware per-layer offsets)
  - vLLM #46290  WRITE-mode per-geometry offset caching
  - vLLM #46332  heterogeneous-TP rank mapping + ACK fan-in

Point minimaxm3-fp8-mi355x-vllm-disagg at
vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15
(vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove
the stop-gap overlay:
  - delete patches/moriio/moriio-minimax-m3-disagg.diff
  - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate)
  - trim the moriio/ section from patches/README.md

Verified on the nightly image with NO patch across all four P/D layouts x
conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4,
2P4+1D8) -- matching the previously-patched results.

Refs #1762.
@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after
the #1862 entry), which violates the append-only changelog gate
("entry 511 changed; existing entries are immutable"). Move it to the
end of perf-changelog.yaml so existing entries stay byte-identical to
main and the new entry is a clean append.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@functionstackx

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all-evals Expand eval selection to every fixed-sequence config non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim)

Projects

Development

Successfully merging this pull request may close these issues.

3 participants