[NV]dsr1-fp4-b200-sglang: add DPA PDL lane by hshrivastava-droid · Pull Request #1792 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-06-15T21:58:00Z

Summary

Adds a data-parallel attention (DPA) benchmark lane for the DeepSeek-R1 FP4 B200 SGLang (dsr1-fp4-b200-sglang) fixed-sequence-length recipe and retunes the concurrency sweep matrix.

Changes

Config (`.github/configs/nvidia-master.yaml`)

Image bump: lmsysorg/sglang:v0.5.12-cu130 → lmsysorg/sglang:v0.5.12.post1
1k/1k search space:
- TP4/EP4 concurrency expanded from 1–128 → 1–256
- TP8/EP8 changed from conc 1–128 sweep → conc-list: [1] (single-point)
8k/1k search space:
- TP4/EP4 conc 1–128 retained
- New: TP4/EP4 with dp-attn: true, conc 64–256
- TP8/EP8 changed from conc 1–16 sweep → conc-list: [1] (single-point)

Script (`benchmarks/single_node/fixed_seq_len/dsr1_fp4_b200.sh`)

Adds DP_ATTENTION env var (default false) with input validation
When DP_ATTENTION=true, launches SGLang with:
- TP-sized data parallelism (--data-parallel-size=$TP)
- --enable-dp-attention, --enable-dp-attention-local-control-broadcast, --enable-dp-lm-head
- --enable-prefill-delayer, --schedule-conservativeness 3.33
- Tighter scheduler recv interval (1 instead of 10/30)
- Larger chunked prefill size (32768 instead of 16384)
All runs now set SGLANG_RADIX_FORCE_MISS=1 env var
Replaces --disable-radix-cache with --disable-piecewise-cuda-graph

Perf Changelog (`perf-changelog.yaml`)

Documents all config and script changes under dsr1-fp4-b200-sglang

Note

Low Risk
Benchmark-only changes (CI config, launch script, changelog) with no production auth, data, or serving-path impact.

Overview
Adds a data-parallel attention (DPA) benchmark path for DeepSeek-R1 FP4 on B200 SGLang and retunes the fixed-seq-len concurrency matrix.

Config (dsr1-fp4-b200-sglang): bumps the image to lmsysorg/sglang:v0.5.12.post1. For 1k/1k, TP4/EP4 sweeps 1–256 (was 1–128); TP8/EP8 is a single point at conc 1 (was 1–128). For 8k/1k, adds TP4/EP4 with dp-attn: true at conc 64–256; TP8/EP8 is conc 1 only (was 1–16).

dsr1_fp4_b200.sh: introduces DP_ATTENTION (default false). When enabled, the server uses TP-sized DP, DP-attention flags, prefill delayer, tighter scheduler recv (1), larger chunked prefill (32768), and extra scheduling conservativeness. All runs set SGLANG_RADIX_FORCE_MISS=1, drop --disable-radix-cache, and add --disable-piecewise-cuda-graph.

perf-changelog.yaml documents the above under dsr1-fp4-b200-sglang.

^{Reviewed by Cursor Bugbot for commit 496cefc. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-15T21:58:09Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4e4e9d3. Configure here.}

github-actions · 2026-06-16T02:27:16Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27579102530
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27579102530

hshrivastava-droid · 2026-06-16T19:42:34Z

@claude add PR description

Klaud-Cold · 2026-06-16T19:43:06Z

Claude finished @hshrivastava-droid's task in 1m 24s —— View job

Adding PR Description

Gather context from changed files
Write and update PR description

PR description has been added with a structured summary covering the three changed files: config matrix updates, script DPA support, and perf-changelog entry.

hshrivastava-droid · 2026-06-16T19:51:31Z

@functionstackx could you please review this

jgangani

LGTM

hshrivastava-droid · 2026-06-22T18:20:06Z

Recipe- https://github.com/sgl-project/sglang/pull/28954/changes

hshrivastava-droid · 2026-06-22T18:20:30Z

@functionstackx could you please review this?

cquil11

lgtm

Fixed

functionstackx

As a PR reviewer and CODEOWNER, I have reviewed this and have:

Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.

Verified that this PR has passed PR validation.

Verified that this PR passes evals.

If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.

Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:

If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:

@Ankur-singh r u sure? -disable-piecewise-cuda-graph etc and lots of other flags here arent in sglang cookbook and i dont see an sglang pr link in ur additional section part of review

Klaud-Cold · 2026-06-23T18:38:50Z

@Ankur-singh Cannot merge: the linked recipe omits reproduction-critical launch config from this PR (Check 3b). The sweep is now green (Checks 0/1/2 pass), so this supersedes the earlier skipped-sweep block.

Check 0 (CODEOWNER): PASS — signer owns .github/configs/nvidia-master.yaml (@ankur-singh); dsr1_fp4_b200.sh + perf-changelog.yaml fall to catch-all * @InferenceX/core, covered.
Check 1 (sweep+evals on in-PR commit): PASS — head 426b637 has green, executed single-node 1k1k /, single-node 8k1k /, and eval / check-runs: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28048542583
Check 2 (eval accuracy): PASS — gsm8k em_strict 0.954–0.957 (dpa true+false, conc 64–256) on image lmsysorg/sglang:v0.5.12.post1, matching this PR's config.
Check 3 (recipe): FAIL — link present (Add dpa for dsr1 fp4 sgl-project/sglang#28954, documents the DPA flags), but the recipe is missing these reproduction-critical items present in this PR's recipe: engine image/version pin (v0.5.12.post1); env var SGLANG_RADIX_FORCE_MISS=1; --quantization modelopt_fp4; --attention-backend trtllm_mla (recipe uses flashinfer); --moe-runner-backend flashinfer_trtllm (recipe uses flashinfer_cutlass); --enable-flashinfer-allreduce-fusion; --disable-piecewise-cuda-graph; and the DPA-tuned --scheduler-recv-interval 1 / --chunked-prefill-size 32768. Add these to the cookbook entry so the B200 FP4 DPA config is actually reproducible.

hshrivastava-droid · 2026-06-23T18:38:59Z

/reuse-sweep-run

github-actions · 2026-06-23T20:01:18Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28048542583
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28048542583

github-actions · 2026-06-23T20:29:19Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28048542583
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28048542583

github-actions · 2026-06-23T20:55:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28048542583
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28048542583

cquil11 · 2026-06-23T21:54:17Z

@claude can you make sure the linked recipe PR aligns with the recipe in this PR?
the major server args should match, not necessarily everything specific to InferenceX

Klaud-Cold · 2026-06-23T21:54:49Z

Claude finished @cquil11's task in 2m 46s —— View job

Comparing linked SGLang recipe PR to this PR's recipe

Read this PR's launch script server args
Fetch linked recipe PR (Add dpa for dsr1 fp4 sgl-project/sglang#28954)
Compare major server args
Report alignment

I pulled the actual base config feeding the recipe (docs_new/src/snippets/autoregressive/deepseek-r1-advanced-deployment.jsx, the b200/fp4 entries) plus the DPA diff in sgl-project/sglang#28954, and compared against the DP_ATTENTION=true branch of dsr1_fp4_b200.sh.

✅ The DP-attention args this PR adds match the recipe exactly

This is the core of this PR's new lane, and it's fully aligned:

Flag	This PR (`.sh`)	Recipe PR #28954
`--data-parallel-size`	`=$TP`	`<TP>` ✅
`--enable-dp-attention`	✅	✅
`--enable-dp-attention-local-control-broadcast`	✅	✅
`--enable-dp-lm-head`	✅	✅
`--schedule-conservativeness 3.33`	✅	✅
`--enable-prefill-delayer`	✅	✅

✅ Shared base args also match

--model-path nvidia/DeepSeek-R1-0528-FP4-v2, --cuda-graph-max-bs 256, --max-running-requests 256, --mem-fraction-static 0.85, --kv-cache-dtype fp8_e4m3, --ep-size, --enable-symm-mem, --stream-interval 10 — all present in both.

⚠️ Two major kernel-selection args are NOT in the recipe

The cookbook's b200/fp4 config does not set these keys, so the generated recipe command omits them (SGLang falls back to defaults; the cookbook's other FP4 entries use flashinfer/flashinfer_cutlass, never trtllm):

Flag	This PR (`.sh`)	Recipe PR #28954
`--attention-backend`	`trtllm_mla`	(unset → default) ❌
`--moe-runner-backend`	`flashinfer_trtllm`	(unset → default) ❌
`--enable-flashinfer-allreduce-fusion`	✅	(unset) ❌
`--quantization`	`modelopt_fp4`	(unset → auto-detected from model) ⚠️

--attention-backend and --moe-runner-backend materially change the kernels and the perf profile, so I'd count those as "major." If you want the recipe to actually reproduce the B200 FP4 numbers, those two should be added to the cookbook config. --quantization modelopt_fp4 is functionally equivalent to the recipe's auto-detection (the FP4 model carries its quant config), so that one is benign.

ℹ️ Remaining diffs are InferenceX/DPA benchmark tuning (expected to differ)

These are the kind of "specific to InferenceX" items you said don't need to match — flagging for completeness:

--scheduler-recv-interval 1 (PR, DPA-tuned) vs 10/30 (recipe)
--chunked-prefill-size 32768 (PR, DPA-tuned) vs 16384 (recipe 8-GPU)
--disable-piecewise-cuda-graph (PR only)
SGLANG_RADIX_FORCE_MISS=1 env (PR only)

Verdict

The DP-attention major server args fully align — the recipe PR faithfully documents the DPA flag set this PR introduces. The only genuinely major gap is the kernel backends (--attention-backend trtllm_mla, --moe-runner-backend flashinfer_trtllm), which the cookbook leaves at default. Worth adding those two to #28954 so the documented B200 FP4 config matches the kernels actually benchmarked; everything else either matches or is reasonable InferenceX-specific tuning.

…G_RADIX_FORCE_MISS and --disable-piecewise-cuda-graph - Motivation: Add `SGLANG_RADIX_FORCE_MISS=1` and replace `--disable-radix-cache` with `--disable-piecewise-cuda-graph` for all B200+FP4 serving configs; update DP-Attention tip with full DPA tuning flags per upstream recipe. - Reference: SemiAnalysisAI/InferenceX#1792

faradawn · 2026-06-23T22:13:38Z

Per above comment, I've added the flags of ModelOpt FP4, TRT-LLM MLA attention, FlashInfer TRT-LLM MoE, and FlashInfer allreduce fusion. https://github.com/sgl-project/sglang/pull/28954/changes.

cquil11

Ok. I think we're good now. Thanks everyone.

functionstackx

lgtm. thank you @Ankur-singh for quicking adapting to the new PR review process

#1792 (comment)

hshrivastava-droid requested a review from a team June 15, 2026 21:58

hshrivastava-droid requested a review from kedarpotdar-nv as a code owner June 15, 2026 21:58

hshrivastava-droid added the full-sweep-enabled label Jun 15, 2026

hshrivastava-droid requested a review from jgangani as a code owner June 15, 2026 21:58

github-project-automation Bot added this to InferenceMAX Board Jun 15, 2026

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread .github/configs/nvidia-master.yaml

hshrivastava-droid changed the title ~~[WIP][NV]dsr1-fp4-b200-sglang: add DPA PDL lane~~ [NV]dsr1-fp4-b200-sglang: add DPA PDL lane Jun 16, 2026

kedarpotdar-nv approved these changes Jun 16, 2026

View reviewed changes

jgangani approved these changes Jun 16, 2026

View reviewed changes

faradawn mentioned this pull request Jun 22, 2026

Add dpa for dsr1 fp4 sgl-project/sglang#28954

Open

5 tasks

cquil11 requested a review from Ankur-singh as a code owner June 22, 2026 20:41

hshrivastava-droid closed this Jun 22, 2026

hshrivastava-droid deleted the nv/dsr1-fp4-v2 branch June 22, 2026 20:48

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 22, 2026

cquil11 restored the nv/dsr1-fp4-v2 branch June 22, 2026 20:50

cquil11 reopened this Jun 22, 2026

cquil11 approved these changes Jun 22, 2026

View reviewed changes

cquil11 requested changes Jun 22, 2026

View reviewed changes

Comment thread benchmarks/single_node/fixed_seq_len/dsr1_fp4_b200.sh

hshrivastava-droid removed the full-sweep-enabled label Jun 22, 2026

add DPA PDL lane

2651117

hshrivastava-droid force-pushed the nv/dsr1-fp4-v2 branch from d9e9701 to 2651117 Compare June 22, 2026 22:39

Merge branch 'main' into nv/dsr1-fp4-v2

a2f2de3

hshrivastava-droid added the full-sweep-enabled label Jun 23, 2026

functionstackx requested changes Jun 23, 2026

View reviewed changes

functionstackx requested a review from cquil11 June 23, 2026 17:34

cquil11 mentioned this pull request Jun 23, 2026

ci: add CODEOWNER sign-off verification workflow #1901

Merged

cquil11 added 2 commits June 23, 2026 13:12

Merge branch 'main' into nv/dsr1-fp4-v2

d7ce1a1

Merge branch 'main' into nv/dsr1-fp4-v2

8bc5659

SemiAnalysisAI deleted a comment from Klaud-Cold Jun 23, 2026

Merge branch 'main' into nv/dsr1-fp4-v2

37af7c5

cquil11 mentioned this pull request Jun 23, 2026

ci: tighten sign-off verifier output + fix catch-all CODEOWNER logic #1904

Merged

Merge branch 'main' into nv/dsr1-fp4-v2

426b637

hshrivastava-droid added full-sweep-enabled and removed full-sweep-enabled labels Jun 23, 2026

Ankur-singh approved these changes Jun 23, 2026

View reviewed changes

cquil11 mentioned this pull request Jun 23, 2026

ci: recipe check matches major server args, not InferenceX-specific tuning #1908

Merged

Merge branch 'main' into nv/dsr1-fp4-v2

ae3a317

cquil11 approved these changes Jun 23, 2026

View reviewed changes

functionstackx approved these changes Jun 23, 2026

View reviewed changes

cquil11 merged commit 94d4968 into main Jun 23, 2026
29 checks passed

cquil11 deleted the nv/dsr1-fp4-v2 branch June 23, 2026 22:51

claude Bot mentioned this pull request Jun 24, 2026

[NV] Refresh Minimax M3 FP8 submission with new recipes for GB300 #1925

Open

Uh oh!

Conversation

hshrivastava-droid commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Config (.github/configs/nvidia-master.yaml)

Script (benchmarks/single_node/fixed_seq_len/dsr1_fp4_b200.sh)

Perf Changelog (perf-changelog.yaml)

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

hshrivastava-droid commented Jun 16, 2026

Uh oh!

Klaud-Cold commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

hshrivastava-droid commented Jun 16, 2026

Uh oh!

jgangani left a comment

Choose a reason for hiding this comment

Uh oh!

hshrivastava-droid commented Jun 22, 2026

Uh oh!

hshrivastava-droid commented Jun 22, 2026

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

functionstackx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Klaud-Cold commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hshrivastava-droid commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

cquil11 commented Jun 23, 2026

Uh oh!

Klaud-Cold commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comparing linked SGLang recipe PR to this PR's recipe

✅ The DP-attention args this PR adds match the recipe exactly

✅ Shared base args also match

⚠️ Two major kernel-selection args are NOT in the recipe

ℹ️ Remaining diffs are InferenceX/DPA benchmark tuning (expected to differ)

Verdict

Uh oh!

faradawn commented Jun 23, 2026

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

hshrivastava-droid commented Jun 15, 2026 •

edited by cursor Bot

Loading

Config (`.github/configs/nvidia-master.yaml`)

Script (`benchmarks/single_node/fixed_seq_len/dsr1_fp4_b200.sh`)

Perf Changelog (`perf-changelog.yaml`)

Klaud-Cold commented Jun 16, 2026 •

edited

Loading

functionstackx left a comment •

edited

Loading

Klaud-Cold commented Jun 23, 2026 •

edited

Loading

Klaud-Cold commented Jun 23, 2026 •

edited

Loading