optimize TTFT qwen3-vl#1006
Conversation
Speeds up the Triton W4A16 skinny prefill GEMM on gfx11x (RDNA3/3.5). The prefill dequant inner loop was VALU-issue-bound; this cuts the dequant instruction count and, for asymmetric layers, folds the per-group scale + zero-point into a single load. ~14-15% lower TTFT on a W4A16 model with no accuracy change. Pure Python/Triton; no C++/HIP rebuild. Changes: - Packed int4->fp16 dequant (_i4_and_or_magic): one v_and_or_b32 dequants two nibbles into fp16 (1024+n) per instruction (the i4_to_half magic trick), replacing the scalar v_and_b16/v_or_b16 pair. The kernel selects the packed path itself at JIT time -- fp16 AND RDNA gfx11/gfx12, via a tl.target_info.constexpr_function (_target_is_gfx1x); no host flag. Everything else uses the scalar unpack. - Distilled gfx11 tile table (_select_skinny_gfx11_config): BLOCK_N=256, num_warps=8, BLOCK_K=32, BLOCK_M 64/128 by K and N. Other arches unchanged. - Asymmetric layers: packed_scale_zp carrier (one fp32 per (n, group)) folds the per-group scale and zero-point offset into a single load. fp16 packs scale|bias_eff and consumes it with one v_pk_fma_f16; bf16 packs scale|zp_int and keeps the int-domain subtract (RDNA3 has no v_pk_fma_bf16). Materialised only for asym layers; the kernel's HAS_ZP constexpr selects carrier vs scales. - Symmetric layers keep a dedicated fast path (scales + constant -8 offset): sym has no second load to fold, so the carrier would be pure overhead (measured ~+8% on fp16 sym, up to +22% on o_proj). - perf test: exercises the carrier for asym providers, adds Qwen3-1.7B and Gemma3-4B prefill shapes; gfx1151 golden regenerated. Measured on gfx1151 (Qwen3-4B-AWQ, fp16, input 3968 / output 1): TTFT 1436 -> 1234 ms (-14.1%). bf16 asym carrier -4.8% (do_bench). Carrier dequant verified numerically (fp16 <1e-3 rel error; bf16 bit-identical). Verified: tests/kernels/quantization/test_hybrid_w4a16_perf.py 111/112 pass -- the one failure is a pre-existing flaky tiny-M wvSplitK_int4 decode cell (unchanged HIP kernel), unrelated to this change. ruff + mypy clean. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Signed-off-by: Xu Qing <qing.xu2@amd.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
with model of Qwen3-VL-4B-Instruct-AWQ-4bit-lm_head_int8, its triton_w4a16_skinny_fmt_kernel, it cost more then 60% latency in prefill. input = 2 448x448 img + 256 token M: ~660 (2x448x448 + 256 token prefill) N,K(hot shape): 660 x 19456 x 2560 660 x 2560 x 9728 660 x 6144 x 2560 660 x 2560 x 4096 warps, currently 8 is best num so: BLOCK_M=64 BLOCK_N=256 BLOCK_K=32 num_warps=8 GEMM:M=660,N/K as above kernel tile:64,256,32,8 Signed-off-by: Xu Qing <qing.xu2@amd.com>
Add a Qwen3-VL-4B prefill shape guard in Triton unified attention on gfx11 and apply BM64/T32/W4/S1/EU4 defaults to reduce TTFT. Also fix prefill tile override wiring so VLLM_UA_PREFILL_TILE_SIZE is honored instead of being overwritten by the default path. Signed-off-by: Xu Qing <qing.xu2@amd.com>
33478ea to
db8ba35
Compare
Hard coding to optimize TTFT of qwen3-vl-4b kernel tile, the input requirement is 2 448x448 img + 256 tokens, then token input ~= 670 tokens.
The modification is based on #985, from Matthias Gehre matthias.gehre@amd.com