Gfx11#29
Draft
liangliangchang wants to merge 22 commits into
Draft
Conversation
Instrumented CUDA kernels to log actual M×N×K dimensions to stderr for performance analysis and TFLOPS/bandwidth calculations. Changes: - quantize.cu: Log M, K, grid for quantize_mmq_q8_1 kernel - mmq.cu: Log M, N, K, ne dimensions for mul_mat_q (GEMM) kernel - mmvq.cu: Log M, K dimensions for mul_mat_vec_q (GEMV) kernel Usage: llama-bench ... 2>&1 | tee profile.log The logged dimensions are matched with rocprofv3 timing data to calculate: - TFLOPS: From actual M×N×K dimensions (2×M×N×K FLOPs) - Bandwidth: From actual data transfer patterns This instrumentation enabled accurate performance analysis showing: - GEMM kernels: 8.83-43.58 TFLOPS (varies by quantization type) - Quantize kernels: 284 GB/s bandwidth (data in L2 cache) - GEMV kernels: 0.05-5.52 TFLOPS (memory-bound decode) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Log GEMM as M=ne11, N=ne01, K=ne00 and GEMV as M=ne01, K=ne00 so instrumented shapes match actual matmul dimensions. Co-authored-by: Cursor <cursoragent@cursor.com>
On gfx115x, mmq_x=128 uses more than half of per-CU LDS and caps occupancy at one workgroup per CU. Select a smaller mmq_x that fits two workgroups without doubling M-tiles, improving P0 prefill TFLOPS ~25% on gfx1151. Co-authored-by: Cursor <cursoragent@cursor.com>
…ning. Restore WMMA granularity at mmq_x>=64, widen tile_y padding to cut LDS bank conflicts, and software-pipeline Q8_1 activation loads into registers during vec_dot to hide global memory latency on gfx115x. Co-authored-by: Cursor <cursoragent@cursor.com>
Enables MMQ_PROFILE_PHASES=1 to report load_tiles, y_act, and vec_dot cycle shares per launch, guiding RDNA3.5 prefill optimization work. Co-authored-by: Cursor <cursoragent@cursor.com>
…head. Add gfx115x load_tiles_q4_K_dm_rdna35 and vec_dot_q4_K_q8_1_mma with per-k dmA hoisting; restore pre-profiling tile loop when MMQ_PROFILE_PHASES is off so extra __syncthreads do not regress P0 TFLOPS. Co-authored-by: Cursor <cursoragent@cursor.com>
Split load_tiles_q4_K qs staging into load_tiles_q4_K_qs_wmma_rdna35 (one warp/row, coalesced qs global load) and mmq_q4_K_qs_lds_k helper that documents the pseudo-q8_1 LDS offsets used by load_ldmatrix. Perf (gfx1151, P0 128×12288×4096): neutral vs prior commit — 26.56 vs 26.51 TFLOPS, 485 vs 486 µs, pp128 ~1135 t/s (within run-to-run noise). Layout math is unchanged; this is groundwork for a Q4_K-native MMA tile. Co-authored-by: Cursor <cursoragent@cursor.com>
fprintf for MUL_MAT_Q, MUL_MAT_VEC_Q, QUANTIZE_MMQ_Q8_1, and MUL_MAT_Q_LAUNCH are off by default to avoid CPU overhead in llama-bench TTFT runs; set GGML_CUDA_MM_LOG=1 for profiling scripts. Co-authored-by: Cursor <cursoragent@cursor.com>
Hoist dmA*dsB into per-lane scl[] before mma on RDNA3.5 vec_dot_q4_K_q8_1_mma to separate scale multiply from C accumulation (ATT showed v_fma_mix hotspots). Prefill P0 Q4_K_M 128x12288x4096 (3x profile_mmq_prefill): median 26.59 TFLOPS / 484.7 us vs 26.56 / 485 us baseline; VGPR 224 (+8), LDS conflict ratio 0.348. Assisted-by: Auto Co-authored-by: Cursor <cursoragent@cursor.com>
clock64() can move backward across __syncthreads, which inflated phase cycle counts and produced garbage load_tiles percentages; discard small backward glitches and skip corrupt host-side totals. Co-authored-by: Cursor <cursoragent@cursor.com>
Use mmq_x=128 for small nrows_x grids instead of dual-WG LDS downsizing, and hoist dA in Q8_0/Q4_0 vec_dot on RDNA3.5 WMMA. Gate 128×32×4096 drops ~209µs to ~105µs with no P0 Q4_K regression. Co-authored-by: Cursor <cursoragent@cursor.com>
Precompute sclA = sc[k01/4]*d per k-slice in vec_dot_q6_K_q8_1_mma instead of reloading scales inside the j-loop. Q6_K FFN down 128×4096×12288 improves ~13.4 to ~15.2 TFLOPS with no P0 regression. Co-authored-by: Cursor <cursoragent@cursor.com>
Prefer ntx=1 tile width when batch ≤128 and LDS allows, improving Q6_K FFN down (~40% faster vs mmq_x=64 on gfx1151). Co-authored-by: Cursor <cursoragent@cursor.com>
…RDNA3.5 Prefetch B across j0 with batched WMMA/epilogue, pad tile_y loads for mmq_x<=64 software pipelining, add MMQ_PROFILE_VEC_SUB (off by default), and pick Q6_K mmq_x=128 only when LDS fits dual-WG occupancy (smpbo/2). Co-authored-by: Cursor <cursoragent@cursor.com>
Software register pipelining of tile_y loads showed no measurable gain (~4% y_act slice) and slightly regressed pp128 on gfx1151; keep padded tile_y LDS layout and the simple sequential load path. Co-authored-by: Cursor <cursoragent@cursor.com>
Load all B tiles per k-slice via load_generic (faster than load_ldmatrix from tile_y), interleave with A on the hot path, and merge the duplicate vec_dot profile loop into one path. Use fast process_tile when only MMQ_PROFILE_VEC_SUB is set. Co-authored-by: Cursor <cursoragent@cursor.com>
Revert the smpbo/2 gate added in 03927f1 for the Q6_K narrow-N launch heuristic. Q6 at mmq_x=128 (38 KiB, ntx=1) is ~40% faster than mmq_x=64 with ntx=2 even at 1 WG/CU; the dual-WG policy belongs on Q4_K only. Co-authored-by: Cursor <cursoragent@cursor.com>
Revert the d4511bd B-batch/load_generic vec_dot rewrite that regressed Q6 FFN down from ~13 to ~8 TFLOPS under rocprof trace (~1.5ms to ~1.1ms per launch). Keep the 4d9b940 j0 loop with load_ldmatrix for B; profile instrumentation remains on a separate g_mmq_profile early-return path only. Co-authored-by: Cursor <cursoragent@cursor.com>
Revert g_mmq_profile_outer/cudaMemcpyToSymbol and Q6 vec_dot global profile branches that regressed Q6_K FFN down ~15% under rocprof trace. Restore RDNA3.5 y-tile software pipeline and MMQ_PROFILE_PHASES-only instrumentation to match 4d9b940 prefill performance. Co-authored-by: Cursor <cursoragent@cursor.com>
Prefetch the next activation B tile while MMA runs on the current tile in the Q6_K narrow-N path, improving FFN-down kernel throughput without changing launch geometry. Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the Q6-only mmq_x=128 upgrade so Q6_K FFN uses the same smpbo/2 mmq_x=64 ntx=2 launch as Q4_K, which measures faster on gfx115x prefill. Co-authored-by: Cursor <cursoragent@cursor.com>
Block quants (Q5_0, Q8_0) regressed when forced to mmq_x=64 for LDS dual-WG occupancy; keep mmq_x=128 for them while retaining mmq_x=64 for K-quants tuned on gfx115x (Q6_K FFN down). Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Additional information
Requirements