Add dpa for dsr1 fp4#28954
Conversation
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| } | ||
|
|
||
| return generateCommandFromConfig(config); | ||
| const isB200Fp4 = vals.hardware === 'b200' && vals.quantization === 'fp4'; |
There was a problem hiding this comment.
This keys on hardware+quantization only, so DP attention is also force-enabled for the low-latency configs (concurrency 4–8), which the prior doc text scoped to high-throughput. Could you confirm against the referenced InferenceX PR (SemiAnalysisAI/InferenceX#1792) whether this recipe was validated for the low-latency scenario, or whether it should be gated on vals.scenario === 'high-throughput'?
| continue; | ||
| } | ||
|
|
||
| if (enableDpAttention && key === 'tensor_parallel_size') { |
There was a problem hiding this comment.
This special-case diverges from the file's config-driven pattern, where commands render purely from each config's parameters + fieldToFlag. Consider moving these flags into the 4 b200-fp4 config blocks and adding the missing fieldToFlag entries (enable_dp_attention, enable_dp_attention_local_control_broadcast, enable_dp_lm_head) — that also makes per-scenario gating trivial.
|
|
||
| if (enableDpAttention && key === 'tensor_parallel_size') { | ||
| command += | ||
| ` \\\n --tensor-parallel-size ${value}` + |
There was a problem hiding this comment.
These emit long-form --tensor-parallel-size / --data-parallel-size, while the rest of the generated command uses short forms (--tp, --ep-size) and fieldToFlag already maps data_parallel_size → 'dp'. Suggest --tp / --dp for copy-paste consistency.
|
|
||
| if (enableDpAttention) { | ||
| command += | ||
| ' \\\n --schedule-conservativeness 3.33' + |
There was a problem hiding this comment.
3.33 is a calibrated value: when dp-attention is on, server_args.py applies schedule_conservativeness *= 0.3, so 3.33 × 0.3 ≈ 1.0 (the default). Worth a short comment so it isn't later "rounded" to an integer and silently dropped to 0.3 effective.
| ``` | ||
|
|
||
| **Data Parallelism Attention (`--enable-dp-attention`):** Recommended for high-throughput scenarios. Use `--enable-dp-attention --tp 8 --dp 8` on a single 8-GPU node. | ||
| **Data Parallelism Attention (`--enable-dp-attention`):** Recommended for high-throughput scenarios. For B200 FP4, the command generator enables DP Attention automatically and adds `--data-parallel-size <TP>`, `--enable-dp-attention-local-control-broadcast`, `--enable-dp-lm-head`, `--schedule-conservativeness 3.33`, and `--enable-prefill-delayer`. |
There was a problem hiding this comment.
This rewrite drops the previous general --enable-dp-attention --tp 8 --dp 8 hint that also applied to other hardware (e.g. H200 high-throughput). If that's still recommended there, consider keeping a one-line general note alongside the B200-FP4-specific text.
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Motivation
For better performance, add DP attention for FP4 B200.
Reference SemiAnalysisAI/InferenceX#1792
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ciCI States
Latest PR Test (Base): ✅ Run #28060588989
Latest PR Test (Extra): ❌ Run #28060588854