Skip to content

Commit 3b28e61

Browse files
committed
[NV] llm-d: prefix-cache experiment (shared prefix + caching ON)
Tests whether our ~4 req/s ceiling is relieved when effective prefill is cheap - the leading explanation for the upstream gap now that the load generator (nyann Go client also capped ~3 req/s) and serving stack are ruled out. Prefill-only probe confirmed prefill compute is the wall at 8k ISL; this checks if a cached shared prefix shrinks that work. - benchmark_lib.sh: run_benchmark_serving gains an optional --random-prefix-len (only appended when > 0; default 0 leaves all existing paths byte-identical). benchmark_serving.py already supports it - one fixed random prefix prepended to every request. - server.sh: pass --random-prefix-len when BENCH_RANDOM_PREFIX_LEN is set (env-gated; normal sweep unchanged). - job.slurm: forward BENCH_RANDOM_PREFIX_LEN into the container env. - dsv4-fp4-gb200-mid-curve-megamoe-prefixcache.yaml: new recipe = the pre-reference-match baseline config WITH prefix caching ON (no --no-enable-prefix-caching), so the only delta vs our known ~4 req/s baseline is the shared-prefix workload - clean attribution. - nvidia-master.yaml: dsv4-fp4-gb200-llm-d-vllm-prefixcache config-key with isl=512 unique suffix + BENCH_RANDOM_PREFIX_LEN=7680 shared prefix (8192 total, matches baseline) + osl=1024. If req/s jumps well above ~4 with the 7680-token prefix cached, prefill compute is the wall and prefix caching is the lever - supporting the upstream-workload explanation. Default benchmark_serving client + ignore_eos forces a clean OSL=1024 (unlike nyann). Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
1 parent 640441f commit 3b28e61

5 files changed

Lines changed: 223 additions & 2 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8716,6 +8716,53 @@ dsv4-fp4-gb200-llm-d-vllm-nyann:
87168716
- "DECODE_NODES=2"
87178717
- "GPUS_PER_NODE=4"
87188718

8719+
# Prefix-cache experiment variant of dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe.
8720+
# Uses the prefix-cache recipe (server-side prefix caching ON) and a
8721+
# shared-prefix workload: BENCH_RANDOM_PREFIX_LEN=7680 prepends a fixed 7680-
8722+
# token prefix to every request, with isl=512 the unique suffix (7680+512=8192
8723+
# total, same as baseline; osl=1024). With caching ON the shared span is a hit
8724+
# after warmup, so effective prefill drops ~16x to the suffix. If req/s jumps
8725+
# vs the ~4 req/s baseline, prefill-compute is the wall and prefix caching is
8726+
# the lever (default benchmark_serving client, ignore_eos forces OSL). Dispatch
8727+
# with --no-evals.
8728+
dsv4-fp4-gb200-llm-d-vllm-prefixcache:
8729+
image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23
8730+
model: deepseek-ai/DeepSeek-V4-Pro
8731+
model-prefix: dsv4
8732+
runner: gb200
8733+
precision: fp4
8734+
framework: llm-d-vllm
8735+
multinode: true
8736+
disagg: true
8737+
scenarios:
8738+
fixed-seq-len:
8739+
- isl: 512
8740+
osl: 1024
8741+
search-space:
8742+
- spec-decoding: "none"
8743+
conc-list: [256, 512, 1024]
8744+
prefill:
8745+
num-worker: 1
8746+
tp: 1
8747+
ep: 8
8748+
dp-attn: true
8749+
additional-settings:
8750+
- "PREFILL_NODES=2"
8751+
- "GPUS_PER_NODE=4"
8752+
- "RANDOM_RANGE_RATIO=0.8"
8753+
- "CONFIG_FILE=dsv4-fp4-gb200-mid-curve-megamoe-prefixcache.yaml"
8754+
- "BENCH_NUM_PROMPTS_MULTIPLIER=10"
8755+
- "MAX_FAILURE_RATE=0.5"
8756+
- "BENCH_RANDOM_PREFIX_LEN=7680"
8757+
decode:
8758+
num-worker: 1
8759+
tp: 1
8760+
ep: 8
8761+
dp-attn: true
8762+
additional-settings:
8763+
- "DECODE_NODES=2"
8764+
- "GPUS_PER_NODE=4"
8765+
87198766
# MTP2 variant of dsv4-fp4-gb200-dynamo-vllm. Uses the vLLM 0.20.1 image
87208767
# and hand-picked 8k/1k Pareto points mirrored from NVIDIA/srt-slurm.
87218768
dsv4-fp4-gb200-dynamo-vllm-mtp2:

benchmarks/benchmark_lib.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,7 @@ run_benchmark_serving() {
219219
local server_pid=""
220220
local tokenizer=""
221221
local tokenizer_mode=""
222+
local random_prefix_len=0
222223

223224
while [[ $# -gt 0 ]]; do
224225
case $1 in
@@ -250,6 +251,10 @@ run_benchmark_serving() {
250251
random_range_ratio="$2"
251252
shift 2
252253
;;
254+
--random-prefix-len)
255+
random_prefix_len="$2"
256+
shift 2
257+
;;
253258
--num-prompts)
254259
num_prompts="$2"
255260
shift 2
@@ -382,6 +387,14 @@ run_benchmark_serving() {
382387
--result-filename "$result_filename.json"
383388
)
384389

390+
# Optional shared prefix: prepend a fixed random prefix of N tokens to
391+
# every request. With server-side prefix caching enabled this makes the
392+
# shared portion a cache hit after warmup, so effective prefill shrinks to
393+
# the unique suffix. Only added when > 0; default 0 leaves behavior intact.
394+
if [[ "${random_prefix_len:-0}" -gt 0 ]]; then
395+
benchmark_cmd+=(--random-prefix-len "$random_prefix_len")
396+
fi
397+
385398
if [[ -n "$endpoint" ]]; then
386399
benchmark_cmd+=(--endpoint "$endpoint")
387400
fi
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# DeepSeek-V4-Pro (FP4) on GB200, MegaMOE mid-curve 1P+1D - PREFIX-CACHE
2+
# experiment variant of dsv4-fp4-gb200-mid-curve-megamoe.yaml.
3+
#
4+
# Hypothesis under test: our throughput is pinned at ~4 req/s because every
5+
# request does a full 8k-token prefill, and prefill compute is the wall
6+
# (confirmed: prefill-only direct probe also caps ~3.84 req/s). The upstream
7+
# wide-ep-lws guide may report far higher throughput because its effective
8+
# prefill is cheap - i.e. a large shared prefix served from the prefix cache.
9+
#
10+
# This recipe enables server-side prefix caching (prefix caching is ON here -
11+
# the --no-enable-prefix-caching flag from the reference-match variant is
12+
# removed). Paired with BENCH_RANDOM_PREFIX_LEN in the config-key, the bench
13+
# prepends a fixed shared prefix to every request; after warmup that span is a
14+
# cache hit, so effective prefill shrinks to the unique suffix. If req/s jumps
15+
# vs the ~4 req/s baseline, prefill-compute is the wall AND prefix caching is
16+
# the lever (supporting the upstream-workload explanation of the gap).
17+
#
18+
# Engine config is the pre-reference-match baseline (prefix caching ON,
19+
# default max-num-batched-tokens, gpu-mem 0.9) so the ONLY difference vs our
20+
# known ~4 req/s baseline run is the shared-prefix workload - clean attribution.
21+
# Topology unchanged: 1 prefill + 1 decode, TP=1 DP=8 EP=8 DEP8, 16 GPUs.
22+
#
23+
# Selected via additional-settings:
24+
# CONFIG_FILE=dsv4-fp4-gb200-mid-curve-megamoe-prefixcache.yaml
25+
# PREFILL_NODES=2 DECODE_NODES=2 GPUS_PER_NODE=4 BENCH_RANDOM_PREFIX_LEN=7680
26+
27+
# ---- EPP scheduling config ----
28+
apiVersion: llm-d.ai/v1alpha1
29+
kind: EndpointPickerConfig
30+
31+
plugins:
32+
- name: file-disc
33+
type: file-discovery
34+
parameters:
35+
path: /tmp/endpoints.yaml
36+
watchFile: false
37+
38+
- type: disagg-headers-handler
39+
- type: always-disagg-pd-decider
40+
- type: disagg-profile-handler
41+
parameters:
42+
deciderPluginName: always-disagg-pd-decider
43+
- type: prefill-filter
44+
- type: decode-filter
45+
- type: prefix-cache-scorer
46+
- type: queue-scorer
47+
- type: kv-cache-utilization-scorer
48+
- type: active-request-scorer
49+
- type: max-score-picker
50+
51+
schedulingProfiles:
52+
- name: prefill
53+
plugins:
54+
- pluginRef: prefill-filter
55+
- pluginRef: prefix-cache-scorer
56+
weight: 3
57+
- pluginRef: queue-scorer
58+
weight: 2
59+
- pluginRef: kv-cache-utilization-scorer
60+
weight: 2
61+
- pluginRef: max-score-picker
62+
- name: decode
63+
plugins:
64+
- pluginRef: decode-filter
65+
- pluginRef: active-request-scorer
66+
weight: 2
67+
- pluginRef: prefix-cache-scorer
68+
weight: 3
69+
- pluginRef: max-score-picker
70+
71+
dataLayer:
72+
discovery:
73+
pluginRef: file-disc
74+
75+
# ---- Per-role vLLM flags ----
76+
# Baseline engine config WITH prefix caching ON (no --no-enable-prefix-caching).
77+
# --enable-cumem-allocator kept (KV in cumem for MNNVL transfer on v0.23+);
78+
# UCX_TLS includes rc so cross-node KV uses MNNVL/IB, never TCP.
79+
prefill:
80+
tp: 1
81+
enable-expert-parallel: true
82+
extra-args: >-
83+
--kv-cache-dtype fp8
84+
--max-model-len 9280
85+
--enforce-eager
86+
--gpu-memory-utilization 0.9
87+
--enable-cumem-allocator
88+
--no-async-scheduling
89+
--block-size 256
90+
--tokenizer-mode deepseek_v4
91+
--moe-backend deep_gemm_mega_moe
92+
--enable-ep-weight-filter
93+
--no-disable-hybrid-kv-cache-manager
94+
--numa-bind
95+
env:
96+
NCCL_CUMEM_ENABLE: "1"
97+
NCCL_MNNVL_ENABLE: "1"
98+
NCCL_NVLS_ENABLE: "1"
99+
NCCL_P2P_LEVEL: "NVL"
100+
UCX_MEMTYPE_CACHE: "n"
101+
UCX_MEMTYPE_REG_WHOLE: "n"
102+
UCX_TLS: "cuda_copy,cuda_ipc,rc,tcp"
103+
UCX_CUDA_IPC_ENABLE_MNNVL: "y"
104+
VLLM_USE_NCCL_SYMM_MEM: "1"
105+
VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "1024"
106+
VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE: "2048"
107+
TILELANG_CLEANUP_TEMP_FILES: "1"
108+
TORCH_SYMMMEM: "NVSHMEM"
109+
VLLM_SKIP_P2P_CHECK: "1"
110+
VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
111+
VLLM_USE_DEEP_GEMM: "1"
112+
NVIDIA_GDRCOPY: "enabled"
113+
VLLM_HTTP_TIMEOUT_KEEP_ALIVE: "120"
114+
115+
decode:
116+
tp: 1
117+
enable-expert-parallel: true
118+
extra-args: >-
119+
--kv-cache-dtype fp8
120+
--max-model-len 9280
121+
--max-num-seqs 512
122+
--max-num-batched-tokens 512
123+
--max-cudagraph-capture-size 512
124+
--gpu-memory-utilization 0.85
125+
--enable-cumem-allocator
126+
--block-size 256
127+
--compilation-config {"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}
128+
--stream-interval 50
129+
--tokenizer-mode deepseek_v4
130+
--moe-backend deep_gemm_mega_moe
131+
--enable-ep-weight-filter
132+
--no-disable-hybrid-kv-cache-manager
133+
env:
134+
NCCL_CUMEM_ENABLE: "1"
135+
NCCL_MNNVL_ENABLE: "1"
136+
NCCL_NVLS_ENABLE: "1"
137+
NCCL_P2P_LEVEL: "NVL"
138+
UCX_MEMTYPE_CACHE: "n"
139+
UCX_MEMTYPE_REG_WHOLE: "n"
140+
UCX_TLS: "cuda_copy,cuda_ipc,rc,tcp"
141+
UCX_CUDA_IPC_ENABLE_MNNVL: "y"
142+
VLLM_USE_NCCL_SYMM_MEM: "1"
143+
TILELANG_CLEANUP_TEMP_FILES: "1"
144+
TORCH_SYMMMEM: "NVSHMEM"
145+
VLLM_SKIP_P2P_CHECK: "1"
146+
VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
147+
VLLM_USE_DEEP_GEMM: "1"
148+
NVIDIA_GDRCOPY: "enabled"
149+
150+
# ---- SLURM resource directives ----
151+
slurm:
152+
time_limit: "08:00:00"

benchmarks/multi_node/llm-d/job.slurm

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,10 +184,10 @@ elif [[ "$LLMD_CONTAINER_ENGINE" == "pyxis" ]]; then
184184
# Optional load-generator selector + prefill-probe flag (from additional-settings).
185185
# Explicitly forwarded so the toggle can't silently fall through to the
186186
# default benchmark_serving sweep if srun's --export=ALL is ever restricted.
187-
export BENCH_TOOL NYANN_DURATION NYANN_WARMUP PREFILL_ONLY_PROBE
187+
export BENCH_TOOL NYANN_DURATION NYANN_WARMUP PREFILL_ONLY_PROBE BENCH_RANDOM_PREFIX_LEN
188188

189189
PYXIS_ENV_LIST="NUM_NODES,PREFILL_NODES,DECODE_NODES,ALL_IPS,PREFILL_LEADER_IP,DECODE_LEADER_IP"
190-
PYXIS_ENV_LIST+=",BENCH_TOOL,NYANN_DURATION,NYANN_WARMUP,PREFILL_ONLY_PROBE"
190+
PYXIS_ENV_LIST+=",BENCH_TOOL,NYANN_DURATION,NYANN_WARMUP,PREFILL_ONLY_PROBE,BENCH_RANDOM_PREFIX_LEN"
191191
PYXIS_ENV_LIST+=",PREFILL_DP_ADDR,DECODE_DP_ADDR,MODEL_NAME,GPUS_PER_NODE"
192192
PYXIS_ENV_LIST+=",PREFILL_DP_SIZE,DECODE_DP_SIZE"
193193
PYXIS_ENV_LIST+=",BENCH_INPUT_LEN,BENCH_OUTPUT_LEN,BENCH_MAX_CONCURRENCY"

benchmarks/multi_node/llm-d/server.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -582,6 +582,15 @@ PY
582582
)
583583
fi
584584

585+
# Optional shared-prefix workload (prefix-cache experiment): prepend a
586+
# fixed random prefix of BENCH_RANDOM_PREFIX_LEN tokens to every
587+
# request. With server-side prefix caching ON, that shared span is a
588+
# cache hit after warmup, so effective prefill drops to the unique
589+
# suffix (--input-len). Default unset -> normal full-prefill sweep.
590+
if [[ "${BENCH_RANDOM_PREFIX_LEN:-0}" -gt 0 ]]; then
591+
bench_extra_args+=(--random-prefix-len "$BENCH_RANDOM_PREFIX_LEN")
592+
fi
593+
585594
run_benchmark_serving \
586595
--bench-serving-dir /workspace \
587596
--tokenizer /models \

0 commit comments

Comments
 (0)