Skip to content

Commit 9cea9a0

Browse files
committed
[NV] llm-d: exact upstream serve-line parity for mid-curve-megamoe
Match the DSV4-Pro FP4 GB200 1P+1D mid-curve run to the upstream wide-ep-lws guide's exact vLLM config, taken from the upstream vLLM 'non-default args' dumps (prefill + decode) shared 2026-06-23. Recipe (dsv4-fp4-gb200-mid-curve-megamoe.yaml): - Re-add --no-enable-prefix-caching on both roles. Upstream runs enable_prefix_caching=False; an earlier strip removed it on the since-disproven 'cacheable prompts' hypothesis, which (vLLM V1 defaults prefix caching ON) silently turned caching ON - the opposite of upstream and of this no-prefix-cache comparison. server.sh: - kv_role is now role-aware (prefill=kv_producer, decode=kv_consumer) to match upstream's clean P/D split instead of hardcoded kv_both. Overridable via KV_ROLE_OVERRIDE=kv_both. nvidia-master.yaml: - mid-curve-megamoe matrix entry RANDOM_RANGE_RATIO 0.8 -> 1.0 to match upstream --random-range-ratio 1.0 (fixed 8192 prefill). Net: prefill and decode serve-line args are now a literal match to the upstream logs (enforce_eager, cumem allocator, gpu-mem 0.9/0.85, max-num-seqs/batched-tokens/model-len unset, flashinfer autotune on, prefix caching off, producer/consumer kv_role). Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
1 parent 3b28e61 commit 9cea9a0

3 files changed

Lines changed: 49 additions & 46 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8613,7 +8613,7 @@ dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe:
86138613
additional-settings:
86148614
- "PREFILL_NODES=2"
86158615
- "GPUS_PER_NODE=4"
8616-
- "RANDOM_RANGE_RATIO=0.8"
8616+
- "RANDOM_RANGE_RATIO=1.0"
86178617
- "CONFIG_FILE=dsv4-fp4-gb200-mid-curve-megamoe.yaml"
86188618
- "BENCH_NUM_PROMPTS_MULTIPLIER=10"
86198619
- "MAX_FAILURE_RATE=0.5"

benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml

Lines changed: 33 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -91,48 +91,36 @@ dataLayer:
9191
prefill:
9292
tp: 1
9393
enable-expert-parallel: true
94-
# Tier 1+2 upstream-wide-ep-lws-style tuning (2026-06-16):
95-
# - dropped --enable-sleep-mode (carryover from earlier OOM fix; not needed at max-model-len=9280)
96-
# - dropped --no-enable-prefix-caching, --no-enable-flashinfer-autotune
97-
# - lowered --gpu-memory-utilization 0.95 -> 0.9
98-
# Re-added --enable-cumem-allocator (2026-06-19): on v0.23+ this puts the
99-
# KV cache in cumem memory, which is REQUIRED to transfer KV over MNNVL
100-
# (GB200 cross-node NVLink). Without it, cross-node KV falls off the
101-
# MNNVL path; combined with UCX_TLS lacking rc that meant TCP transfer
102-
# (per upstream wide-ep-lws owner) - the likely cause of the prefill
103-
# KV-handoff backup. (Older releases got cumem via --enable-sleep-mode.)
104-
#
105-
# Prefill now matches the dynamo-vllm reference recipe's prefill config in
106-
# full (3148/5336/6036 tok/s/GPU): see
107-
# benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve-megamoe.yaml.
108-
# Goal: test whether the prefill batch budget closes the throughput gap.
109-
# At ISL 8192 the default max-num-batched-tokens=8192 batched ONE prefill
110-
# per step (EPP showed RunningRequestsSize=1, WaitingQueueSize=177, KV
111-
# 6.7%), serializing admission at ~4 req/s and flattening throughput vs
112-
# concurrency (~2000 tok/s/GPU). 32768 lets prefill batch ~4 per step.
113-
# The batch budget could NOT be isolated: 32768 at gpu-mem 0.9 OOM'd on the
114-
# KV-block check, and at 0.95 (alone) hit a hard CUDA OOM during warmup
115-
# (32768-token activation peak + NCCL/NIXL/DeepGEMM runtime overran 184 GiB).
116-
# 32768 only fits with the reference's companion memory-reducers, so they
117-
# are restored together as a set:
118-
# - --max-num-batched-tokens 32768 (the prefill batch lever)
119-
# - --gpu-memory-utilization 0.95 (headroom for the larger batch)
120-
# - --max-num-seqs 16 (caps per-seq buffers/block tables)
121-
# - --no-enable-prefix-caching (random prompts -> ~0 hit; frees KV)
122-
# - --no-enable-flashinfer-autotune (avoids autotune workspace)
123-
# --enable-cumem-allocator is the v0.23 equivalent of the reference's
124-
# --enable-sleep-mode (KV in cumem for MNNVL transfer).
94+
# Upstream-parity strip (2026-06-23): on the upstream wide-ep-lws owner's
95+
# advice, dropped the InferenceX-specific prefill customizations so the
96+
# engine config matches the guide (which beats us ~3x on the same
97+
# workload). Dropped:
98+
# - --max-model-len 9280 (use model default, as upstream)
99+
# - --max-num-seqs 16 (leave uncapped, as upstream)
100+
# - --max-num-batched-tokens 32768 (revert to default 8192, as upstream)
101+
# - --no-enable-flashinfer-autotune (leave autotune ON, as upstream)
102+
# KEPT --no-enable-prefix-caching: the upstream vLLM `non-default args` dump
103+
# (2026-06-23) shows enable_prefix_caching=False on BOTH prefill and decode,
104+
# so upstream runs prefix caching OFF. An earlier note here wrongly claimed
105+
# upstream left it ON; that was based on the since-disproven "cacheable
106+
# prompts" hypothesis (HANDOFF_upstream_workload_confirmed.md). vLLM V1
107+
# defaults prefix caching ON, so we must pass --no-enable-prefix-caching to
108+
# match upstream and to keep this a true no-prefix-cache comparison.
109+
# --enable-sleep-mode was already replaced by --enable-cumem-allocator,
110+
# which is the v0.23 equivalent (KV in cumem, REQUIRED for cross-node MNNVL
111+
# KV transfer on GB200) and which upstream ALSO sets - so it stays.
112+
# --gpu-memory-utilization reverted 0.95 -> 0.9 to match upstream exactly:
113+
# the 0.95 was headroom for the now-removed 32768 batch, so it is no longer
114+
# needed. With this, the prefill engine config is at full upstream parity
115+
# (only kv_role kv_both vs producer/consumer remains, forced by the no-kube
116+
# sidecar topology in server.sh).
125117
extra-args: >-
126118
--kv-cache-dtype fp8
127-
--max-model-len 9280
128-
--max-num-seqs 16
129-
--max-num-batched-tokens 32768
130119
--enforce-eager
131-
--gpu-memory-utilization 0.95
120+
--gpu-memory-utilization 0.9
132121
--enable-cumem-allocator
133-
--no-async-scheduling
134122
--no-enable-prefix-caching
135-
--no-enable-flashinfer-autotune
123+
--no-async-scheduling
136124
--block-size 256
137125
--tokenizer-mode deepseek_v4
138126
--moe-backend deep_gemm_mega_moe
@@ -167,20 +155,21 @@ prefill:
167155
decode:
168156
tp: 1
169157
enable-expert-parallel: true
170-
# Tier 1+2 upstream-wide-ep-lws-style tuning (2026-06-16):
171-
# - dropped --enable-sleep-mode
172-
# - dropped --no-enable-prefix-caching, --no-enable-flashinfer-autotune
173-
# - lowered --gpu-memory-utilization 0.9 -> 0.85
174-
# Re-added --enable-cumem-allocator (2026-06-19): KV cache in cumem is
175-
# required for MNNVL KV transfer on v0.23+ (see prefill note).
158+
# Upstream-parity strip (2026-06-23): dropped --max-model-len 9280 to use
159+
# the model default, as upstream does. (--no-enable-flashinfer-autotune and
160+
# --enable-sleep-mode were already removed in earlier tuning.)
161+
# --enable-cumem-allocator stays: required for MNNVL KV transfer on v0.23+
162+
# and upstream sets it too. --gpu-memory-utilization 0.85 already matches
163+
# upstream. --no-enable-prefix-caching matches upstream decode's
164+
# enable_prefix_caching=False (see prefill note).
176165
extra-args: >-
177166
--kv-cache-dtype fp8
178-
--max-model-len 9280
179167
--max-num-seqs 512
180168
--max-num-batched-tokens 512
181169
--max-cudagraph-capture-size 512
182170
--gpu-memory-utilization 0.85
183171
--enable-cumem-allocator
172+
--no-enable-prefix-caching
184173
--block-size 256
185174
--compilation-config {"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}
186175
--stream-interval 50

benchmarks/multi_node/llm-d/server.sh

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,7 +211,21 @@ fi
211211
# flags are wrong for the single-node-per-instance case where DP is
212212
# contained inside one engine process.
213213
# ----------------------------------------------------------------
214-
KV_TRANSFER_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}'
214+
# kv_role: the upstream wide-ep-lws guide runs a clean producer/consumer
215+
# split (prefill=kv_producer, decode=kv_consumer), confirmed from its vLLM
216+
# `non-default args` dump (2026-06-23). We previously hardcoded kv_both on
217+
# every rank, which allocates both send+recv KV buffers and is the one
218+
# remaining serve-line difference vs upstream. Default now mirrors upstream
219+
# by $ROLE; set KV_ROLE_OVERRIDE=kv_both to fall back to the old behavior
220+
# if the no-kube pd-sidecar / NIXL handshake turns out to need it.
221+
if [[ -n "${KV_ROLE_OVERRIDE:-}" ]]; then
222+
KV_ROLE="$KV_ROLE_OVERRIDE"
223+
elif [[ "$ROLE" == "prefill" ]]; then
224+
KV_ROLE="kv_producer"
225+
else
226+
KV_ROLE="kv_consumer"
227+
fi
228+
KV_TRANSFER_CONFIG="{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"$KV_ROLE\",\"kv_load_failure_policy\":\"fail\"}"
215229

216230
COMMON_ARGS=(
217231
--port "$VLLM_PORT"

0 commit comments

Comments
 (0)