optimize TTFT qwen3-vl by qingxuamd · Pull Request #1006 · ROCm/vllm

qingxuamd · 2026-06-15T15:27:00Z

Hard coding to optimize TTFT of qwen3-vl-4b kernel tile, the input requirement is 2 448x448 img + 256 tokens, then token input ~= 670 tokens.
The modification is based on #985, from Matthias Gehre matthias.gehre@amd.com

Speeds up the Triton W4A16 skinny prefill GEMM on gfx11x (RDNA3/3.5). The prefill dequant inner loop was VALU-issue-bound; this cuts the dequant instruction count and, for asymmetric layers, folds the per-group scale + zero-point into a single load. ~14-15% lower TTFT on a W4A16 model with no accuracy change. Pure Python/Triton; no C++/HIP rebuild. Changes: - Packed int4->fp16 dequant (_i4_and_or_magic): one v_and_or_b32 dequants two nibbles into fp16 (1024+n) per instruction (the i4_to_half magic trick), replacing the scalar v_and_b16/v_or_b16 pair. The kernel selects the packed path itself at JIT time -- fp16 AND RDNA gfx11/gfx12, via a tl.target_info.constexpr_function (_target_is_gfx1x); no host flag. Everything else uses the scalar unpack. - Distilled gfx11 tile table (_select_skinny_gfx11_config): BLOCK_N=256, num_warps=8, BLOCK_K=32, BLOCK_M 64/128 by K and N. Other arches unchanged. - Asymmetric layers: packed_scale_zp carrier (one fp32 per (n, group)) folds the per-group scale and zero-point offset into a single load. fp16 packs scale|bias_eff and consumes it with one v_pk_fma_f16; bf16 packs scale|zp_int and keeps the int-domain subtract (RDNA3 has no v_pk_fma_bf16). Materialised only for asym layers; the kernel's HAS_ZP constexpr selects carrier vs scales. - Symmetric layers keep a dedicated fast path (scales + constant -8 offset): sym has no second load to fold, so the carrier would be pure overhead (measured ~+8% on fp16 sym, up to +22% on o_proj). - perf test: exercises the carrier for asym providers, adds Qwen3-1.7B and Gemma3-4B prefill shapes; gfx1151 golden regenerated. Measured on gfx1151 (Qwen3-4B-AWQ, fp16, input 3968 / output 1): TTFT 1436 -> 1234 ms (-14.1%). bf16 asym carrier -4.8% (do_bench). Carrier dequant verified numerically (fp16 <1e-3 rel error; bf16 bit-identical). Verified: tests/kernels/quantization/test_hybrid_w4a16_perf.py 111/112 pass -- the one failure is a pre-existing flaky tiny-M wvSplitK_int4 decode cell (unchanged HIP kernel), unrelated to this change. ruff + mypy clean. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Signed-off-by: Xu Qing <qing.xu2@amd.com>

github-actions · 2026-06-15T15:27:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

with model of Qwen3-VL-4B-Instruct-AWQ-4bit-lm_head_int8, its triton_w4a16_skinny_fmt_kernel, it cost more then 60% latency in prefill. input = 2 448x448 img + 256 token M: ~660 (2x448x448 + 256 token prefill） N,K（hot shape）： 660 x 19456 x 2560 660 x 2560 x 9728 660 x 6144 x 2560 660 x 2560 x 4096 warps, currently 8 is best num so: BLOCK_M=64 BLOCK_N=256 BLOCK_K=32 num_warps=8 GEMM：M=660，N/K as above kernel tile：64,256,32,8 Signed-off-by: Xu Qing <qing.xu2@amd.com>

Add a Qwen3-VL-4B prefill shape guard in Triton unified attention on gfx11 and apply BM64/T32/W4/S1/EU4 defaults to reduce TTFT. Also fix prefill tile override wiring so VLLM_UA_PREFILL_TILE_SIZE is honored instead of being overwritten by the default path. Signed-off-by: Xu Qing <qing.xu2@amd.com>

qingxuamd requested a review from AndreasKaratzas as a code owner June 15, 2026 15:27

mgehre-amd requested review from mgehre-amd and removed request for AndreasKaratzas June 15, 2026 17:44

mgehre-amd changed the title ~~optimize TTFT qwen3-vl kernel tile specifically for China GAC customer~~ optimize TTFT qwen3-vl Jun 15, 2026

qingxuamd added 2 commits June 17, 2026 09:51

qingxuamd force-pushed the qingxu/qwen3-vl-optimize branch from 33478ea to db8ba35 Compare June 17, 2026 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize TTFT qwen3-vl#1006

optimize TTFT qwen3-vl#1006
qingxuamd wants to merge 3 commits into
gfx11from
qingxu/qwen3-vl-optimize

qingxuamd commented Jun 15, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qingxuamd commented Jun 15, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qingxuamd commented Jun 15, 2026 •

edited by github-actions Bot

Loading