[NV]dsr1-fp4-b200-sglang: add DPA PDL lane#1792
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4e4e9d3. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27579102530 |
|
@claude add PR description |
|
Claude finished @hshrivastava-droid's task in 1m 24s —— View job Adding PR Description
PR description has been added with a structured summary covering the three changed files: config matrix updates, script DPA support, and perf-changelog entry. |
|
@functionstackx could you please review this |
|
@functionstackx could you please review this? |
d9e9701 to
2651117
Compare
There was a problem hiding this comment.
As a PR reviewer and CODEOWNER, I have reviewed this and have:
Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
Verified that this PR has passed PR validation.
Verified that this PR passes evals.
If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
- If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
@Ankur-singh r u sure? -disable-piecewise-cuda-graph etc and lots of other flags here arent in sglang cookbook and i dont see an sglang pr link in ur additional section part of review
|
@Ankur-singh Cannot merge: the linked recipe omits reproduction-critical launch config from this PR (Check 3b). The sweep is now green (Checks 0/1/2 pass), so this supersedes the earlier skipped-sweep block.
|
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28048542583 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28048542583 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28048542583 |
|
@claude can you make sure the linked recipe PR aligns with the recipe in this PR? |
|
Claude finished @cquil11's task in 2m 46s —— View job Comparing linked SGLang recipe PR to this PR's recipe
I pulled the actual base config feeding the recipe ( ✅ The DP-attention args this PR adds match the recipe exactlyThis is the core of this PR's new lane, and it's fully aligned:
✅ Shared base args also match
|
| Flag | This PR (.sh) |
Recipe PR #28954 |
|---|---|---|
--attention-backend |
trtllm_mla |
(unset → default) ❌ |
--moe-runner-backend |
flashinfer_trtllm |
(unset → default) ❌ |
--enable-flashinfer-allreduce-fusion |
✅ | (unset) ❌ |
--quantization |
modelopt_fp4 |
(unset → auto-detected from model) |
--attention-backend and --moe-runner-backend materially change the kernels and the perf profile, so I'd count those as "major." If you want the recipe to actually reproduce the B200 FP4 numbers, those two should be added to the cookbook config. --quantization modelopt_fp4 is functionally equivalent to the recipe's auto-detection (the FP4 model carries its quant config), so that one is benign.
ℹ️ Remaining diffs are InferenceX/DPA benchmark tuning (expected to differ)
These are the kind of "specific to InferenceX" items you said don't need to match — flagging for completeness:
--scheduler-recv-interval 1(PR, DPA-tuned) vs10/30(recipe)--chunked-prefill-size 32768(PR, DPA-tuned) vs16384(recipe 8-GPU)--disable-piecewise-cuda-graph(PR only)SGLANG_RADIX_FORCE_MISS=1env (PR only)
Verdict
The DP-attention major server args fully align — the recipe PR faithfully documents the DPA flag set this PR introduces. The only genuinely major gap is the kernel backends (--attention-backend trtllm_mla, --moe-runner-backend flashinfer_trtllm), which the cookbook leaves at default. Worth adding those two to #28954 so the documented B200 FP4 config matches the kernels actually benchmarked; everything else either matches or is reasonable InferenceX-specific tuning.
…G_RADIX_FORCE_MISS and --disable-piecewise-cuda-graph - Motivation: Add `SGLANG_RADIX_FORCE_MISS=1` and replace `--disable-radix-cache` with `--disable-piecewise-cuda-graph` for all B200+FP4 serving configs; update DP-Attention tip with full DPA tuning flags per upstream recipe. - Reference: SemiAnalysisAI/InferenceX#1792
|
Per above comment, I've added the flags of ModelOpt FP4, TRT-LLM MLA attention, FlashInfer TRT-LLM MoE, and FlashInfer allreduce fusion. https://github.com/sgl-project/sglang/pull/28954/changes. |
cquil11
left a comment
There was a problem hiding this comment.
Ok. I think we're good now. Thanks everyone.
functionstackx
left a comment
There was a problem hiding this comment.
lgtm. thank you @Ankur-singh for quicking adapting to the new PR review process

Summary
Adds a data-parallel attention (DPA) benchmark lane for the DeepSeek-R1 FP4 B200 SGLang (
dsr1-fp4-b200-sglang) fixed-sequence-length recipe and retunes the concurrency sweep matrix.Changes
Config (
.github/configs/nvidia-master.yaml)lmsysorg/sglang:v0.5.12-cu130→lmsysorg/sglang:v0.5.12.post1dp-attn: true, conc 64–256Script (
benchmarks/single_node/fixed_seq_len/dsr1_fp4_b200.sh)DP_ATTENTIONenv var (defaultfalse) with input validationDP_ATTENTION=true, launches SGLang with:--data-parallel-size=$TP)--enable-dp-attention,--enable-dp-attention-local-control-broadcast,--enable-dp-lm-head--enable-prefill-delayer,--schedule-conservativeness 3.33SGLANG_RADIX_FORCE_MISS=1env var--disable-radix-cachewith--disable-piecewise-cuda-graphPerf Changelog (
perf-changelog.yaml)dsr1-fp4-b200-sglangNote
Low Risk
Benchmark-only changes (CI config, launch script, changelog) with no production auth, data, or serving-path impact.
Overview
Adds a data-parallel attention (DPA) benchmark path for DeepSeek-R1 FP4 on B200 SGLang and retunes the fixed-seq-len concurrency matrix.
Config (
dsr1-fp4-b200-sglang): bumps the image tolmsysorg/sglang:v0.5.12.post1. For 1k/1k, TP4/EP4 sweeps 1–256 (was 1–128); TP8/EP8 is a single point at conc 1 (was 1–128). For 8k/1k, adds TP4/EP4 withdp-attn: trueat conc 64–256; TP8/EP8 is conc 1 only (was 1–16).dsr1_fp4_b200.sh: introducesDP_ATTENTION(defaultfalse). When enabled, the server uses TP-sized DP, DP-attention flags, prefill delayer, tighter scheduler recv (1), larger chunked prefill (32768), and extra scheduling conservativeness. All runs setSGLANG_RADIX_FORCE_MISS=1, drop--disable-radix-cache, and add--disable-piecewise-cuda-graph.perf-changelog.yamldocuments the above underdsr1-fp4-b200-sglang.Reviewed by Cursor Bugbot for commit 496cefc. Bugbot is set up for automated code reviews on this repo. Configure here.