Skip to content

Add dpa for dsr1 fp4#28954

Open
faradawn wants to merge 7 commits into
sgl-project:mainfrom
faradawn:add-dpa-for-dsr1
Open

Add dpa for dsr1 fp4#28954
faradawn wants to merge 7 commits into
sgl-project:mainfrom
faradawn:add-dpa-for-dsr1

Conversation

@faradawn

@faradawn faradawn commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Motivation

For better performance, add DP attention for FP4 B200.

Reference SemiAnalysisAI/InferenceX#1792

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ✅ Run #28060588989
Latest PR Test (Extra): ❌ Run #28060588854

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation deepseek labels Jun 22, 2026
}

return generateCommandFromConfig(config);
const isB200Fp4 = vals.hardware === 'b200' && vals.quantization === 'fp4';

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keys on hardware+quantization only, so DP attention is also force-enabled for the low-latency configs (concurrency 4–8), which the prior doc text scoped to high-throughput. Could you confirm against the referenced InferenceX PR (SemiAnalysisAI/InferenceX#1792) whether this recipe was validated for the low-latency scenario, or whether it should be gated on vals.scenario === 'high-throughput'?

continue;
}

if (enableDpAttention && key === 'tensor_parallel_size') {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This special-case diverges from the file's config-driven pattern, where commands render purely from each config's parameters + fieldToFlag. Consider moving these flags into the 4 b200-fp4 config blocks and adding the missing fieldToFlag entries (enable_dp_attention, enable_dp_attention_local_control_broadcast, enable_dp_lm_head) — that also makes per-scenario gating trivial.


if (enableDpAttention && key === 'tensor_parallel_size') {
command +=
` \\\n --tensor-parallel-size ${value}` +

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These emit long-form --tensor-parallel-size / --data-parallel-size, while the rest of the generated command uses short forms (--tp, --ep-size) and fieldToFlag already maps data_parallel_size → 'dp'. Suggest --tp / --dp for copy-paste consistency.


if (enableDpAttention) {
command +=
' \\\n --schedule-conservativeness 3.33' +

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.33 is a calibrated value: when dp-attention is on, server_args.py applies schedule_conservativeness *= 0.3, so 3.33 × 0.3 ≈ 1.0 (the default). Worth a short comment so it isn't later "rounded" to an integer and silently dropped to 0.3 effective.

```

**Data Parallelism Attention (`--enable-dp-attention`):** Recommended for high-throughput scenarios. Use `--enable-dp-attention --tp 8 --dp 8` on a single 8-GPU node.
**Data Parallelism Attention (`--enable-dp-attention`):** Recommended for high-throughput scenarios. For B200 FP4, the command generator enables DP Attention automatically and adds `--data-parallel-size <TP>`, `--enable-dp-attention-local-control-broadcast`, `--enable-dp-lm-head`, `--schedule-conservativeness 3.33`, and `--enable-prefill-delayer`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rewrite drops the previous general --enable-dp-attention --tp 8 --dp 8 hint that also applied to other hardware (e.g. H200 high-throughput). If that's still recommended there, consider keeping a one-line general note alongside the B200-FP4-specific text.

faradawn added 2 commits June 23, 2026 11:33
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants