optimization qwen3-vl-4b TTFT for gfx1150 with 2 448x448 image and 256 text token input #1012
Open
qingxuamd wants to merge 2 commits into
Open
optimization qwen3-vl-4b TTFT for gfx1150 with 2 448x448 image and 256 text token input #1012qingxuamd wants to merge 2 commits into
qingxuamd wants to merge 2 commits into
Conversation
This optimization is for gfx1150. The model of Qwen3-VL-4B-Instruct-AWQ-4bit-lm_head_int8, for triton_w4a16_skinny_fmt_kernel, it cost more then 60% latency in prefill. The required input = 2 448x448 img + 256 token, then, tok num ~=660 tok. M: ~660 (2x448x448 + 256 token prefill) N,K(hot shape): 660 x 19456 x 2560 660 x 2560 x 9728 660 x 6144 x 2560 660 x 2560 x 4096 warps, currently 8 is best num so: BLOCK_M=64 BLOCK_N=256 BLOCK_K=32 num_warps=8 GEMM:M=671,N/K as above kernel tile:64,256,32,8 Signed-off-by: Xu Qing <qing.xu2@amd.com>
Add a Qwen3-VL-4B prefill shape guard in Triton unified attention on gfx11 and apply BM64/T32/W4/S1/EU4 defaults to reduce TTFT. Also fix prefill tile override wiring so VLLM_UA_PREFILL_TILE_SIZE is honored instead of being overwritten by the default path. Signed-off-by: Xu Qing <qing.xu2@amd.com>
mgehre-amd
reviewed
Jun 22, 2026
Comment on lines
+278
to
+287
| # Profile-guided default for Qwen3-VL-like multimodal prefill. | ||
| qwen3_prefill_shapes = { | ||
| (19456, 2560), # gate_up_proj-like | ||
| (2560, 9728), # down_proj-like | ||
| (6144, 2560), # qkv_proj-like | ||
| (2560, 4096), # o_proj-like | ||
| } | ||
| if 576 <= M <= 832 and (N, K) in qwen3_prefill_shapes: | ||
| return 64, 256, min(32, group_size), 8 | ||
|
|
There was a problem hiding this comment.
This change already landed in https://github.com/ROCm/vllm/pull/1009/changes for bfloat16.
Could you please put this in the same shape, i.e.
if on_gfx1103() and M > 256:
# Tested on Qwen3-VL-4B-AWQ
block_m, block_n, block_k, num_warps = 64, 256, 64, 8
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Specifically optimize the kernel tile according to qwen3-vl-4b and input shape, without this PR, TTFT=1231 ms, with this PR, TTFT = 980 ms.