Skip to content

optimize TTFT qwen3-vl#1006

Open
qingxuamd wants to merge 3 commits into
gfx11from
qingxu/qwen3-vl-optimize
Open

optimize TTFT qwen3-vl#1006
qingxuamd wants to merge 3 commits into
gfx11from
qingxu/qwen3-vl-optimize

Conversation

@qingxuamd

@qingxuamd qingxuamd commented Jun 15, 2026

Copy link
Copy Markdown

Hard coding to optimize TTFT of qwen3-vl-4b kernel tile, the input requirement is 2 448x448 img + 256 tokens, then token input ~= 670 tokens.
The modification is based on #985, from Matthias Gehre matthias.gehre@amd.com

Speeds up the Triton W4A16 skinny prefill GEMM on gfx11x (RDNA3/3.5). The
prefill dequant inner loop was VALU-issue-bound; this cuts the dequant
instruction count and, for asymmetric layers, folds the per-group scale +
zero-point into a single load. ~14-15% lower TTFT on a W4A16 model with no
accuracy change. Pure Python/Triton; no C++/HIP rebuild.

Changes:
- Packed int4->fp16 dequant (_i4_and_or_magic): one v_and_or_b32 dequants two
  nibbles into fp16 (1024+n) per instruction (the i4_to_half magic trick),
  replacing the scalar v_and_b16/v_or_b16 pair. The kernel selects the packed
  path itself at JIT time -- fp16 AND RDNA gfx11/gfx12, via a
  tl.target_info.constexpr_function (_target_is_gfx1x); no host flag. Everything
  else uses the scalar unpack.
- Distilled gfx11 tile table (_select_skinny_gfx11_config): BLOCK_N=256,
  num_warps=8, BLOCK_K=32, BLOCK_M 64/128 by K and N. Other arches unchanged.
- Asymmetric layers: packed_scale_zp carrier (one fp32 per (n, group)) folds the
  per-group scale and zero-point offset into a single load. fp16 packs
  scale|bias_eff and consumes it with one v_pk_fma_f16; bf16 packs scale|zp_int
  and keeps the int-domain subtract (RDNA3 has no v_pk_fma_bf16). Materialised
  only for asym layers; the kernel's HAS_ZP constexpr selects carrier vs scales.
- Symmetric layers keep a dedicated fast path (scales + constant -8 offset): sym
  has no second load to fold, so the carrier would be pure overhead (measured
  ~+8% on fp16 sym, up to +22% on o_proj).
- perf test: exercises the carrier for asym providers, adds Qwen3-1.7B and
  Gemma3-4B prefill shapes; gfx1151 golden regenerated.

Measured on gfx1151 (Qwen3-4B-AWQ, fp16, input 3968 / output 1): TTFT
1436 -> 1234 ms (-14.1%). bf16 asym carrier -4.8% (do_bench). Carrier dequant
verified numerically (fp16 <1e-3 rel error; bf16 bit-identical).

Verified: tests/kernels/quantization/test_hybrid_w4a16_perf.py 111/112 pass --
the one failure is a pre-existing flaky tiny-M wvSplitK_int4 decode cell
(unchanged HIP kernel), unrelated to this change. ruff + mypy clean.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Signed-off-by: Xu Qing <qing.xu2@amd.com>
@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mgehre-amd mgehre-amd requested review from mgehre-amd and removed request for AndreasKaratzas June 15, 2026 17:44
@mgehre-amd mgehre-amd changed the title optimize TTFT qwen3-vl kernel tile specifically for China GAC customer optimize TTFT qwen3-vl Jun 15, 2026
with model of Qwen3-VL-4B-Instruct-AWQ-4bit-lm_head_int8, its
triton_w4a16_skinny_fmt_kernel, it cost more then 60% latency
in prefill. input = 2 448x448 img + 256 token

M: ~660 (2x448x448 + 256 token  prefill)
N,K(hot shape):
660 x 19456 x 2560
660 x 2560 x 9728
660 x 6144 x 2560
660 x 2560 x 4096
warps, currently 8 is best num

so:
BLOCK_M=64
BLOCK_N=256
BLOCK_K=32
num_warps=8

GEMM:M=660,N/K as above
kernel tile:64,256,32,8

Signed-off-by: Xu Qing <qing.xu2@amd.com>
Add a Qwen3-VL-4B prefill shape guard in Triton unified attention on gfx11
and apply BM64/T32/W4/S1/EU4 defaults to reduce TTFT. Also fix prefill
tile override wiring so VLLM_UA_PREFILL_TILE_SIZE is honored instead of
being overwritten by the default path.

Signed-off-by: Xu Qing <qing.xu2@amd.com>
@qingxuamd qingxuamd force-pushed the qingxu/qwen3-vl-optimize branch from 33478ea to db8ba35 Compare June 17, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants