Gfx11 by liangliangchang · Pull Request #29 · ROCm/llama.cpp

liangliangchang · 2026-06-26T18:04:55Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

Instrumented CUDA kernels to log actual M×N×K dimensions to stderr for performance analysis and TFLOPS/bandwidth calculations. Changes: - quantize.cu: Log M, K, grid for quantize_mmq_q8_1 kernel - mmq.cu: Log M, N, K, ne dimensions for mul_mat_q (GEMM) kernel - mmvq.cu: Log M, K dimensions for mul_mat_vec_q (GEMV) kernel Usage: llama-bench ... 2>&1 | tee profile.log The logged dimensions are matched with rocprofv3 timing data to calculate: - TFLOPS: From actual M×N×K dimensions (2×M×N×K FLOPs) - Bandwidth: From actual data transfer patterns This instrumentation enabled accurate performance analysis showing: - GEMM kernels: 8.83-43.58 TFLOPS (varies by quantization type) - Quantize kernels: 284 GB/s bandwidth (data in L2 cache) - GEMV kernels: 0.05-5.52 TFLOPS (memory-bound decode) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Log GEMM as M=ne11, N=ne01, K=ne00 and GEMV as M=ne01, K=ne00 so instrumented shapes match actual matmul dimensions. Co-authored-by: Cursor <cursoragent@cursor.com>

On gfx115x, mmq_x=128 uses more than half of per-CU LDS and caps occupancy at one workgroup per CU. Select a smaller mmq_x that fits two workgroups without doubling M-tiles, improving P0 prefill TFLOPS ~25% on gfx1151. Co-authored-by: Cursor <cursoragent@cursor.com>

…ning. Restore WMMA granularity at mmq_x>=64, widen tile_y padding to cut LDS bank conflicts, and software-pipeline Q8_1 activation loads into registers during vec_dot to hide global memory latency on gfx115x. Co-authored-by: Cursor <cursoragent@cursor.com>

Enables MMQ_PROFILE_PHASES=1 to report load_tiles, y_act, and vec_dot cycle shares per launch, guiding RDNA3.5 prefill optimization work. Co-authored-by: Cursor <cursoragent@cursor.com>

…head. Add gfx115x load_tiles_q4_K_dm_rdna35 and vec_dot_q4_K_q8_1_mma with per-k dmA hoisting; restore pre-profiling tile loop when MMQ_PROFILE_PHASES is off so extra __syncthreads do not regress P0 TFLOPS. Co-authored-by: Cursor <cursoragent@cursor.com>

Split load_tiles_q4_K qs staging into load_tiles_q4_K_qs_wmma_rdna35 (one warp/row, coalesced qs global load) and mmq_q4_K_qs_lds_k helper that documents the pseudo-q8_1 LDS offsets used by load_ldmatrix. Perf (gfx1151, P0 128×12288×4096): neutral vs prior commit — 26.56 vs 26.51 TFLOPS, 485 vs 486 µs, pp128 ~1135 t/s (within run-to-run noise). Layout math is unchanged; this is groundwork for a Q4_K-native MMA tile. Co-authored-by: Cursor <cursoragent@cursor.com>

fprintf for MUL_MAT_Q, MUL_MAT_VEC_Q, QUANTIZE_MMQ_Q8_1, and MUL_MAT_Q_LAUNCH are off by default to avoid CPU overhead in llama-bench TTFT runs; set GGML_CUDA_MM_LOG=1 for profiling scripts. Co-authored-by: Cursor <cursoragent@cursor.com>

Hoist dmA*dsB into per-lane scl[] before mma on RDNA3.5 vec_dot_q4_K_q8_1_mma to separate scale multiply from C accumulation (ATT showed v_fma_mix hotspots). Prefill P0 Q4_K_M 128x12288x4096 (3x profile_mmq_prefill): median 26.59 TFLOPS / 484.7 us vs 26.56 / 485 us baseline; VGPR 224 (+8), LDS conflict ratio 0.348. Assisted-by: Auto Co-authored-by: Cursor <cursoragent@cursor.com>

clock64() can move backward across __syncthreads, which inflated phase cycle counts and produced garbage load_tiles percentages; discard small backward glitches and skip corrupt host-side totals. Co-authored-by: Cursor <cursoragent@cursor.com>

Use mmq_x=128 for small nrows_x grids instead of dual-WG LDS downsizing, and hoist dA in Q8_0/Q4_0 vec_dot on RDNA3.5 WMMA. Gate 128×32×4096 drops ~209µs to ~105µs with no P0 Q4_K regression. Co-authored-by: Cursor <cursoragent@cursor.com>

Precompute sclA = sc[k01/4]*d per k-slice in vec_dot_q6_K_q8_1_mma instead of reloading scales inside the j-loop. Q6_K FFN down 128×4096×12288 improves ~13.4 to ~15.2 TFLOPS with no P0 regression. Co-authored-by: Cursor <cursoragent@cursor.com>

Prefer ntx=1 tile width when batch ≤128 and LDS allows, improving Q6_K FFN down (~40% faster vs mmq_x=64 on gfx1151). Co-authored-by: Cursor <cursoragent@cursor.com>

…RDNA3.5 Prefetch B across j0 with batched WMMA/epilogue, pad tile_y loads for mmq_x<=64 software pipelining, add MMQ_PROFILE_VEC_SUB (off by default), and pick Q6_K mmq_x=128 only when LDS fits dual-WG occupancy (smpbo/2). Co-authored-by: Cursor <cursoragent@cursor.com>

Software register pipelining of tile_y loads showed no measurable gain (~4% y_act slice) and slightly regressed pp128 on gfx1151; keep padded tile_y LDS layout and the simple sequential load path. Co-authored-by: Cursor <cursoragent@cursor.com>

Load all B tiles per k-slice via load_generic (faster than load_ldmatrix from tile_y), interleave with A on the hot path, and merge the duplicate vec_dot profile loop into one path. Use fast process_tile when only MMQ_PROFILE_VEC_SUB is set. Co-authored-by: Cursor <cursoragent@cursor.com>

Revert the smpbo/2 gate added in 03927f1 for the Q6_K narrow-N launch heuristic. Q6 at mmq_x=128 (38 KiB, ntx=1) is ~40% faster than mmq_x=64 with ntx=2 even at 1 WG/CU; the dual-WG policy belongs on Q4_K only. Co-authored-by: Cursor <cursoragent@cursor.com>

Revert the d4511bd B-batch/load_generic vec_dot rewrite that regressed Q6 FFN down from ~13 to ~8 TFLOPS under rocprof trace (~1.5ms to ~1.1ms per launch). Keep the 4d9b940 j0 loop with load_ldmatrix for B; profile instrumentation remains on a separate g_mmq_profile early-return path only. Co-authored-by: Cursor <cursoragent@cursor.com>

Revert g_mmq_profile_outer/cudaMemcpyToSymbol and Q6 vec_dot global profile branches that regressed Q6_K FFN down ~15% under rocprof trace. Restore RDNA3.5 y-tile software pipeline and MMQ_PROFILE_PHASES-only instrumentation to match 4d9b940 prefill performance. Co-authored-by: Cursor <cursoragent@cursor.com>

Prefetch the next activation B tile while MMA runs on the current tile in the Q6_K narrow-N path, improving FFN-down kernel throughput without changing launch geometry. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop the Q6-only mmq_x=128 upgrade so Q6_K FFN uses the same smpbo/2 mmq_x=64 ntx=2 launch as Q4_K, which measures faster on gfx115x prefill. Co-authored-by: Cursor <cursoragent@cursor.com>

Block quants (Q5_0, Q8_0) regressed when forced to mmq_x=64 for LDS dual-WG occupancy; keep mmq_x=128 for them while retaining mmq_x=64 for K-quants tuned on gfx115x (Q6_K FFN down). Co-authored-by: Cursor <cursoragent@cursor.com>

liangliangchang and others added 22 commits June 23, 2026 13:12

Fix M/N/K dimension logging for MMQ and MMVQ profiling.

3c63ba2

Log GEMM as M=ne11, N=ne01, K=ne00 and GEMV as M=ne01, K=ne00 so instrumented shapes match actual matmul dimensions. Co-authored-by: Cursor <cursoragent@cursor.com>

Add opt-in HIP MMQ phase profiling to split dequant vs MMA time.

7248ec9

Enables MMQ_PROFILE_PHASES=1 to report load_tiles, y_act, and vec_dot cycle shares per launch, guiding RDNA3.5 prefill optimization work. Co-authored-by: Cursor <cursoragent@cursor.com>

cuda: use mmq_x=128 for Q6_K narrow-N prefill on RDNA3.5

4d9b940

Prefer ntx=1 tile width when batch ≤128 and LDS allows, improving Q6_K FFN down (~40% faster vs mmq_x=64 on gfx1151). Co-authored-by: Cursor <cursoragent@cursor.com>

cuda: pipeline Q6_K B-tile loads in RDNA3.5 WMMA vec_dot

272c12e

Prefetch the next activation B tile while MMA runs on the current tile in the Q6_K narrow-N path, improving FFN-down kernel throughput without changing launch geometry. Co-authored-by: Cursor <cursoragent@cursor.com>

cuda: use dual-WG mmq_x=64 for Q6_K narrow-N on RDNA3.5

4648486

Drop the Q6-only mmq_x=128 upgrade so Q6_K FFN uses the same smpbo/2 mmq_x=64 ntx=2 launch as Q4_K, which measures faster on gfx115x prefill. Co-authored-by: Cursor <cursoragent@cursor.com>

cuda: limit RDNA3.5 dual-WG mmq_x downgrade to K-quants

35bd178

Block quants (Q5_0, Q8_0) regressed when forced to mmq_x=64 for LDS dual-WG occupancy; keep mmq_x=128 for them while retaining mmq_x=64 for K-quants tuned on gfx115x (Q6_K FFN down). Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gfx11#29

Gfx11#29
liangliangchang wants to merge 22 commits into
ROCm:gfx11from
liangliangchang:gfx11

liangliangchang commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

liangliangchang commented Jun 26, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant