Skip to content

[NV]dsr1-fp4-b200-sglang: add DPA PDL lane#1792

Merged
cquil11 merged 8 commits into
mainfrom
nv/dsr1-fp4-v2
Jun 23, 2026
Merged

[NV]dsr1-fp4-b200-sglang: add DPA PDL lane#1792
cquil11 merged 8 commits into
mainfrom
nv/dsr1-fp4-v2

Conversation

@hshrivastava-droid

@hshrivastava-droid hshrivastava-droid commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a data-parallel attention (DPA) benchmark lane for the DeepSeek-R1 FP4 B200 SGLang (dsr1-fp4-b200-sglang) fixed-sequence-length recipe and retunes the concurrency sweep matrix.

Changes

Config (.github/configs/nvidia-master.yaml)

  • Image bump: lmsysorg/sglang:v0.5.12-cu130lmsysorg/sglang:v0.5.12.post1
  • 1k/1k search space:
    • TP4/EP4 concurrency expanded from 1–128 → 1–256
    • TP8/EP8 changed from conc 1–128 sweep → conc-list: [1] (single-point)
  • 8k/1k search space:
    • TP4/EP4 conc 1–128 retained
    • New: TP4/EP4 with dp-attn: true, conc 64–256
    • TP8/EP8 changed from conc 1–16 sweep → conc-list: [1] (single-point)

Script (benchmarks/single_node/fixed_seq_len/dsr1_fp4_b200.sh)

  • Adds DP_ATTENTION env var (default false) with input validation
  • When DP_ATTENTION=true, launches SGLang with:
    • TP-sized data parallelism (--data-parallel-size=$TP)
    • --enable-dp-attention, --enable-dp-attention-local-control-broadcast, --enable-dp-lm-head
    • --enable-prefill-delayer, --schedule-conservativeness 3.33
    • Tighter scheduler recv interval (1 instead of 10/30)
    • Larger chunked prefill size (32768 instead of 16384)
  • All runs now set SGLANG_RADIX_FORCE_MISS=1 env var
  • Replaces --disable-radix-cache with --disable-piecewise-cuda-graph

Perf Changelog (perf-changelog.yaml)

  • Documents all config and script changes under dsr1-fp4-b200-sglang

Note

Low Risk
Benchmark-only changes (CI config, launch script, changelog) with no production auth, data, or serving-path impact.

Overview
Adds a data-parallel attention (DPA) benchmark path for DeepSeek-R1 FP4 on B200 SGLang and retunes the fixed-seq-len concurrency matrix.

Config (dsr1-fp4-b200-sglang): bumps the image to lmsysorg/sglang:v0.5.12.post1. For 1k/1k, TP4/EP4 sweeps 1–256 (was 1–128); TP8/EP8 is a single point at conc 1 (was 1–128). For 8k/1k, adds TP4/EP4 with dp-attn: true at conc 64–256; TP8/EP8 is conc 1 only (was 1–16).

dsr1_fp4_b200.sh: introduces DP_ATTENTION (default false). When enabled, the server uses TP-sized DP, DP-attention flags, prefill delayer, tighter scheduler recv (1), larger chunked prefill (32768), and extra scheduling conservativeness. All runs set SGLANG_RADIX_FORCE_MISS=1, drop --disable-radix-cache, and add --disable-piecewise-cuda-graph.

perf-changelog.yaml documents the above under dsr1-fp4-b200-sglang.

Reviewed by Cursor Bugbot for commit 496cefc. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4e4e9d3. Configure here.

Comment thread .github/configs/nvidia-master.yaml
@github-actions

Copy link
Copy Markdown
Contributor

@hshrivastava-droid hshrivastava-droid changed the title [WIP][NV]dsr1-fp4-b200-sglang: add DPA PDL lane [NV]dsr1-fp4-b200-sglang: add DPA PDL lane Jun 16, 2026
@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

@claude add PR description

@Klaud-Cold

Klaud-Cold commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Claude finished @hshrivastava-droid's task in 1m 24s —— View job


Adding PR Description

  • Gather context from changed files
  • Write and update PR description

PR description has been added with a structured summary covering the three changed files: config matrix updates, script DPA support, and perf-changelog entry.

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

@functionstackx could you please review this

@jgangani jgangani left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

@functionstackx could you please review this?

@cquil11 cquil11 requested a review from Ankur-singh as a code owner June 22, 2026 20:41
@hshrivastava-droid hshrivastava-droid deleted the nv/dsr1-fp4-v2 branch June 22, 2026 20:48
@cquil11 cquil11 restored the nv/dsr1-fp4-v2 branch June 22, 2026 20:50
@cquil11 cquil11 reopened this Jun 22, 2026

@cquil11 cquil11 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread benchmarks/single_node/fixed_seq_len/dsr1_fp4_b200.sh

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a PR reviewer and CODEOWNER, I have reviewed this and have:

  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.

  • Verified that this PR has passed PR validation.

  • Verified that this PR passes evals.

  • If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.

  • Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:

    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:

@Ankur-singh r u sure? -disable-piecewise-cuda-graph etc and lots of other flags here arent in sglang cookbook and i dont see an sglang pr link in ur additional section part of review

image

@SemiAnalysisAI SemiAnalysisAI deleted a comment from Klaud-Cold Jun 23, 2026
@Klaud-Cold

Klaud-Cold commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

@Ankur-singh Cannot merge: the linked recipe omits reproduction-critical launch config from this PR (Check 3b). The sweep is now green (Checks 0/1/2 pass), so this supersedes the earlier skipped-sweep block.

  • Check 0 (CODEOWNER): PASS — signer owns .github/configs/nvidia-master.yaml (@ankur-singh); dsr1_fp4_b200.sh + perf-changelog.yaml fall to catch-all * @InferenceX/core, covered.
  • Check 1 (sweep+evals on in-PR commit): PASS — head 426b637 has green, executed single-node 1k1k /, single-node 8k1k /, and eval / check-runs: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28048542583
  • Check 2 (eval accuracy): PASS — gsm8k em_strict 0.954–0.957 (dpa true+false, conc 64–256) on image lmsysorg/sglang:v0.5.12.post1, matching this PR's config.
  • Check 3 (recipe): FAIL — link present (Add dpa for dsr1 fp4 sgl-project/sglang#28954, documents the DPA flags), but the recipe is missing these reproduction-critical items present in this PR's recipe: engine image/version pin (v0.5.12.post1); env var SGLANG_RADIX_FORCE_MISS=1; --quantization modelopt_fp4; --attention-backend trtllm_mla (recipe uses flashinfer); --moe-runner-backend flashinfer_trtllm (recipe uses flashinfer_cutlass); --enable-flashinfer-allreduce-fusion; --disable-piecewise-cuda-graph; and the DPA-tuned --scheduler-recv-interval 1 / --chunked-prefill-size 32768. Add these to the cookbook entry so the B200 FP4 DPA config is actually reproducible.

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@github-actions

Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@cquil11

cquil11 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

@claude can you make sure the linked recipe PR aligns with the recipe in this PR?
the major server args should match, not necessarily everything specific to InferenceX

@Klaud-Cold

Klaud-Cold commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Claude finished @cquil11's task in 2m 46s —— View job


Comparing linked SGLang recipe PR to this PR's recipe


I pulled the actual base config feeding the recipe (docs_new/src/snippets/autoregressive/deepseek-r1-advanced-deployment.jsx, the b200/fp4 entries) plus the DPA diff in sgl-project/sglang#28954, and compared against the DP_ATTENTION=true branch of dsr1_fp4_b200.sh.

✅ The DP-attention args this PR adds match the recipe exactly

This is the core of this PR's new lane, and it's fully aligned:

Flag This PR (.sh) Recipe PR #28954
--data-parallel-size =$TP <TP>
--enable-dp-attention
--enable-dp-attention-local-control-broadcast
--enable-dp-lm-head
--schedule-conservativeness 3.33
--enable-prefill-delayer

✅ Shared base args also match

--model-path nvidia/DeepSeek-R1-0528-FP4-v2, --cuda-graph-max-bs 256, --max-running-requests 256, --mem-fraction-static 0.85, --kv-cache-dtype fp8_e4m3, --ep-size, --enable-symm-mem, --stream-interval 10 — all present in both.

⚠️ Two major kernel-selection args are NOT in the recipe

The cookbook's b200/fp4 config does not set these keys, so the generated recipe command omits them (SGLang falls back to defaults; the cookbook's other FP4 entries use flashinfer/flashinfer_cutlass, never trtllm):

Flag This PR (.sh) Recipe PR #28954
--attention-backend trtllm_mla (unset → default)
--moe-runner-backend flashinfer_trtllm (unset → default)
--enable-flashinfer-allreduce-fusion (unset)
--quantization modelopt_fp4 (unset → auto-detected from model) ⚠️

--attention-backend and --moe-runner-backend materially change the kernels and the perf profile, so I'd count those as "major." If you want the recipe to actually reproduce the B200 FP4 numbers, those two should be added to the cookbook config. --quantization modelopt_fp4 is functionally equivalent to the recipe's auto-detection (the FP4 model carries its quant config), so that one is benign.

ℹ️ Remaining diffs are InferenceX/DPA benchmark tuning (expected to differ)

These are the kind of "specific to InferenceX" items you said don't need to match — flagging for completeness:

  • --scheduler-recv-interval 1 (PR, DPA-tuned) vs 10/30 (recipe)
  • --chunked-prefill-size 32768 (PR, DPA-tuned) vs 16384 (recipe 8-GPU)
  • --disable-piecewise-cuda-graph (PR only)
  • SGLANG_RADIX_FORCE_MISS=1 env (PR only)

Verdict

The DP-attention major server args fully align — the recipe PR faithfully documents the DPA flag set this PR introduces. The only genuinely major gap is the kernel backends (--attention-backend trtllm_mla, --moe-runner-backend flashinfer_trtllm), which the cookbook leaves at default. Worth adding those two to #28954 so the documented B200 FP4 config matches the kernels actually benchmarked; everything else either matches or is reasonable InferenceX-specific tuning.

faradawn pushed a commit to hogepodge/sglang that referenced this pull request Jun 23, 2026
…G_RADIX_FORCE_MISS and --disable-piecewise-cuda-graph

- Motivation: Add `SGLANG_RADIX_FORCE_MISS=1` and replace `--disable-radix-cache` with `--disable-piecewise-cuda-graph` for all B200+FP4 serving configs; update DP-Attention tip with full DPA tuning flags per upstream recipe.
- Reference: SemiAnalysisAI/InferenceX#1792
@faradawn

Copy link
Copy Markdown

Per above comment, I've added the flags of ModelOpt FP4, TRT-LLM MLA attention, FlashInfer TRT-LLM MoE, and FlashInfer allreduce fusion. https://github.com/sgl-project/sglang/pull/28954/changes.

@cquil11 cquil11 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think we're good now. Thanks everyone.

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thank you @Ankur-singh for quicking adapting to the new PR review process

#1792 (comment)

@cquil11 cquil11 merged commit 94d4968 into main Jun 23, 2026
29 checks passed
@cquil11 cquil11 deleted the nv/dsr1-fp4-v2 branch June 23, 2026 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

8 participants