You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[NV] llm-d: exact upstream serve-line parity for mid-curve-megamoe
Match the DSV4-Pro FP4 GB200 1P+1D mid-curve run to the upstream
wide-ep-lws guide's exact vLLM config, taken from the upstream vLLM
'non-default args' dumps (prefill + decode) shared 2026-06-23.
Recipe (dsv4-fp4-gb200-mid-curve-megamoe.yaml):
- Re-add --no-enable-prefix-caching on both roles. Upstream runs
enable_prefix_caching=False; an earlier strip removed it on the
since-disproven 'cacheable prompts' hypothesis, which (vLLM V1
defaults prefix caching ON) silently turned caching ON - the
opposite of upstream and of this no-prefix-cache comparison.
server.sh:
- kv_role is now role-aware (prefill=kv_producer, decode=kv_consumer)
to match upstream's clean P/D split instead of hardcoded kv_both.
Overridable via KV_ROLE_OVERRIDE=kv_both.
nvidia-master.yaml:
- mid-curve-megamoe matrix entry RANDOM_RANGE_RATIO 0.8 -> 1.0 to
match upstream --random-range-ratio 1.0 (fixed 8192 prefill).
Net: prefill and decode serve-line args are now a literal match to
the upstream logs (enforce_eager, cumem allocator, gpu-mem 0.9/0.85,
max-num-seqs/batched-tokens/model-len unset, flashinfer autotune on,
prefix caching off, producer/consumer kv_role).
Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
0 commit comments