@@ -770,7 +770,7 @@ the high-reuse steady state.
770770Added a separate 1P/1D DEP8-prefill experiment using 16 inference GPUs. It
771771keeps Dynamo KV routing, prefix caching, the 32K retention interval, KV event
772772publication, and the validated TP8 decode path. Its prefill follows the repo's
773- existing vLLM DEP pattern (` TP1 x DP8 ` , EP8, ` deep_gemm_mega_moe ` ) and raises
773+ existing vLLM DEP pattern (` TP1 x DP8 ` , EP8) and raises
774774the prefill batch-token ceiling from 16K to 32K. This tests whether eight-way
775775attention parallelism can improve raw prefill throughput enough to offset the
776776expected per-rank load-balance and cache-affinity penalty.
@@ -788,3 +788,11 @@ every discovered worker's vLLM running/waiting gauges between points. It
788788requires three consecutive idle polls before continuing, waits up to 30
789789minutes by default, and fails rather than contaminating the next result if the
790790system cannot drain. It does not clear KV state.
791+
792+ The first DEP8 bring-up (` 27925198626 ` , Slurm ` 19547 ` ) failed during model
793+ load with ` KeyError: layers.0.ffn.experts.w13_input_scale ` . EP filtering had
794+ already selected the expected 48/384 experts per rank. The failure came from
795+ the explicitly inherited ` deep_gemm_mega_moe ` loader, whose expected scale
796+ layout does not match the current v0.23 NVFP4 checkpoint. The override was
797+ removed so DEP8 uses the same checkpoint-compatible default MoE backend as the
798+ successful TEP8 recipes; no topology or cache setting changed.
0 commit comments