feat(gemma4): add mtp loader and step graph#182
Conversation
There was a problem hiding this comment.
1 issue found across 21 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/test/test_flash_attn_sparse.cpp">
<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:114">
P2: The dense-vs-sparse equality check is far too loose; `max_diff < 1.0f` can let major incorrect outputs pass even though the test claims exact match.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
6e884e0 to
ff15c4d
Compare
64bd65d to
797b56b
Compare
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
The test calls pflash_register_ggml_kernel(), which is defined in
src/pflash_ggml_adapter.cpp. That source is only compiled into
dflash27b on CUDA + sm>=80 (the elseif branch at line ~291). The test
target was unguarded, so CI runners building at sm<80 successfully
built dflash27b without the adapter, then failed to link the test:
undefined reference to `pflash_register_ggml_kernel()'
Mirror the same guard used by test_flashprefill_kernels at line ~396.
Plain-C++ compile units that include internal.h (e.g. smoke tests built with g++ rather than nvcc) hit a fatal error on the unconditional #include <cuda_runtime.h> guarded by GGML_USE_CUDA. The runtime header is only needed by src/cuda_cross_device_copy.cpp (the implementation TU), which already includes it directly. Replace the header include with a forward declaration of cudaStream_t (typedef struct CUstream_st*) so consumers of the dflash_cuda_copy_ between_devices prototype don't need CUDA include paths. Found via CI failure on the smoke_load_gemma4_target / smoke_gemma4_target_forward targets after CI started reaching them following the test_flash_attn_sparse cmake guard fix.
797b56b to
7009729
Compare
The smoke driver was leaving GemmaGraphInputs::swa_mask null, causing gemma4_target_graph SWA layers to fall back to attn_mask via the effective_mask = swa_mask ?: attn_mask path. attn_mask is sized for the full-attn view (kv_len_padded), but SWA layers view the full swa_ctx_alloc slots, so flash attention reads past the end of attn_mask into adjacent GPU memory. Manifests as all-NaN logits with Q4_0 / TQ3_0 KV (the OOB bytes are interpreted as fp16 mask values added to attention scores). Q8_0 tolerated the OOB read by accident; Q4_0 / TQ3_0 do not. The bug is documented in gemma4_runtime_helpers.cpp:124-130 — the runtime helper used by the daemon path sets swa_mask correctly. The smoke driver bypassed that helper. Allocate swa_mask sized [align_up(swa_ctx_alloc, 256), n_tokens] and fill the same causal pattern as attn_mask. Caught by running the test on a real GPU (CI only compiles, doesn't run). No production code changed.
The test stub allocated GemmaTargetCache::attn_k/attn_v with max_ctx=64, but production create_gemma4_cache (gemma4_target_graph.cpp:570) 256-aligns max_ctx for full-attention donor layers (head_dim>=512 requires it for the FA view padding contract). With max_ctx=64, build_mtp_step_graph for the first full-attention donor layer (layer 3 in Dense 31B's SWA-odd pattern, donor=58) tried to ggml_view_3d 256 rows from a 64-row parent and tripped GGML_ASSERT(data_size + view_offs <= ggml_nbytes(view_src)) in ggml.c:1748. Bumping the test stub to max_ctx=256 mirrors production's 256-alignment policy. No production code changed; production was never affected since its allocator already handles this correctly. Caught by running the test on a real GPU (CI only compiles, doesn't run).
7009729 to
a475ece
Compare
The [mtp-fa-types] printf in build_mtp_step_graph fires inside the per-layer loop of the per-token MTP step graph. With 4-8 MTP layers and one print per layer per decode token, it writes thousands of lines of stdout during decode — and the daemon's banner protocol shares the same stdout channel. Delete the diagnostic; the FA-type computation it inspected is still exercised by test_mtp_graph_shapes if needed for future debugging. See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Finding S1).
There was a problem hiding this comment.
3 issues found across 19 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/test/gemma4/test_mtp_graph_shapes.cpp">
<violation number="1" location="dflash/test/gemma4/test_mtp_graph_shapes.cpp:80">
P1: Shape-only test still allocates the full `tok_embd` table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.</violation>
</file>
<file name="dflash/CMakeLists.txt">
<violation number="1" location="dflash/CMakeLists.txt:452">
P1: New Gemma4/MTP test targets hardcode `ggml-cuda` instead of using the backend-agnostic `${DFLASH27B_GGML_BACKEND_TARGET}`, breaking HIP test builds.</violation>
</file>
<file name="dflash/src/gemma4_target_graph.cpp">
<violation number="1" location="dflash/src/gemma4_target_graph.cpp:976">
P2: PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be il*nb[1] (not il*n_tokens*nb[1]) and view height should be 1 (not n_tokens).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| // optional; nullptr → falls back to base rope_theta scaling). | ||
|
|
||
| out.backend = backend; | ||
| out.buf = ggml_backend_alloc_ctx_tensors(out.ctx, backend); |
There was a problem hiding this comment.
P1: Shape-only test still allocates the full tok_embd table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/gemma4/test_mtp_graph_shapes.cpp, line 80:
<comment>Shape-only test still allocates the full `tok_embd` table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.</comment>
<file context>
@@ -0,0 +1,298 @@
+ // optional; nullptr → falls back to base rope_theta scaling).
+
+ out.backend = backend;
+ out.buf = ggml_backend_alloc_ctx_tensors(out.ctx, backend);
+ if (!out.buf) { ggml_free(out.ctx); out.ctx = nullptr; return false; }
+
</file context>
| if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_load_gemma4_target.cpp") | ||
| add_executable(smoke_load_gemma4_target test/gemma4/smoke_load_gemma4_target.cpp) | ||
| target_include_directories(smoke_load_gemma4_target PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS}) | ||
| target_link_libraries(smoke_load_gemma4_target PRIVATE dflash27b ggml ggml-cuda) |
There was a problem hiding this comment.
P1: New Gemma4/MTP test targets hardcode ggml-cuda instead of using the backend-agnostic ${DFLASH27B_GGML_BACKEND_TARGET}, breaking HIP test builds.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/CMakeLists.txt, line 452:
<comment>New Gemma4/MTP test targets hardcode `ggml-cuda` instead of using the backend-agnostic `${DFLASH27B_GGML_BACKEND_TARGET}`, breaking HIP test builds.</comment>
<file context>
@@ -436,6 +446,41 @@ if(DFLASH27B_TESTS)
+ if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_load_gemma4_target.cpp")
+ add_executable(smoke_load_gemma4_target test/gemma4/smoke_load_gemma4_target.cpp)
+ target_include_directories(smoke_load_gemma4_target PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS})
+ target_link_libraries(smoke_load_gemma4_target PRIVATE dflash27b ggml ggml-cuda)
+ endif()
+ if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_gemma4_target_forward.cpp")
</file context>
| const int n_embd_per_layer = w.n_embd_per_layer > 0 ? w.n_embd_per_layer | ||
| : (int)in.per_layer_inp->ne[0]; | ||
| ggml_tensor * ple_emb; | ||
| if (ggml_n_dims(in.per_layer_inp) >= 3 || (int)in.per_layer_inp->ne[1] == w.n_layer) { |
There was a problem hiding this comment.
P2: PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be ilnb[1] (not iln_tokens*nb[1]) and view height should be 1 (not n_tokens).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/gemma4_target_graph.cpp, line 976:
<comment>PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be il*nb[1] (not il*n_tokens*nb[1]) and view height should be 1 (not n_tokens).</comment>
<file context>
@@ -0,0 +1,1073 @@
+ const int n_embd_per_layer = w.n_embd_per_layer > 0 ? w.n_embd_per_layer
+ : (int)in.per_layer_inp->ne[0];
+ ggml_tensor * ple_emb;
+ if (ggml_n_dims(in.per_layer_inp) >= 3 || (int)in.per_layer_inp->ne[1] == w.n_layer) {
+ // Shape [n_embd_per_layer, n_layer] or [n_embd_per_layer, n_tokens, n_layer]
+ ple_emb = ggml_view_2d(ctx, in.per_layer_inp,
</file context>
ggml_flash_attn_ext returns a contiguous tensor by spec. The unconditional `ggml_cont(ctx, attn_out)` at the end of every MTP layer was a no-op kernel for the non-TQ3 path (the v_is_tq3 branch above already inserts its own cont before turbo_wht). Gate the cont on v_is_tq3 to skip the kernel when not needed. Also remove the now-unused `kv_is_tq3` local that became dead after the [mtp-fa-types] printf was deleted (its only consumer). See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Finding S8).
…nsors_est The hoisted layer-0 donor KV validation at the top of build_mtp_step_graph was 100% duplicated by the in-loop validation that iterates il = 0..n_layer-1 (which already runs the same donor_target_layer / kv_read_slot bounds checks for layer 0 on its first iteration). Delete the hoisted block; preserve the in-loop check unchanged. No semantic change — the error-set messages from the in-loop path are equivalent and include the layer index. Also document the n_tensors_est formula's "80 ops per layer" magic number with a per-op breakdown so a future maintainer doesn't have to reverse-engineer it. See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Additional findings — defensive checks that never fire / magic numbers).
Record fresh worktree probes for PRs Luce-Org#182, Luce-Org#181, Luce-Org#180, Luce-Org#154, and Luce-Org#131, including Codex feasibility reports for the selective-port candidates and retained audit worktree paths.
Record the 2026-05-28 06:33Z unattended refresh, direct merge probes, and PR Luce-Org#182 delegated feasibility results.
|
Closing: targets obsolete dflash/src/gemma4_mtp_*.cpp paths. Gemma4 MTP would now be implemented as a concrete IExternalDrafterMtp module against the interface declared in #237 — fresh implementation, not a port. The tensor-layout / asymmetric-KV knowledge from this branch remains useful as reference if/when that follow-up is picked up. |
Validation (for review)
Upstream
llama-benchbaseline (Gemma4-31B-it Q4_K_M, q4_0 KV, FA on, RTX 3090): pp16384 = 376.62 tok/s, tg300 = 20.39 tok/s.MTP assistant loader + per-step cross-attention graph. Plus: deletes per-token-per-layer
[mtp-fa-types]stdout flood and a redundant ggml_cont on the non-TQ3 path.PR #10 of the Gemma4 split sequence. Final non-daemon PR. Adds Gemma4 MTP (Multi-Token Prediction) assistant loading and one-step graph construction. Not wired into the decode loop — that integration lands with deferred PR #11.
Depends on #181 → #180 → #179 → #177 → #176 → #171 → #170 → #169 → #168. Stacks on `split/gemma4-09-dflash-draft-runtime`; diff includes ancestor commits until they merge.
Scope (6 files, +1761 / -0)
None of the PR feat(gemma4): add target API and GGUF loader #171 target-loader behavior is reopened.
Risk: HIGH
First introduction of the MTP code surface. The loader bumps `internal.h` with a sizable struct surface (MtpLayerWeights, MtpDrafterWeights, MtpStepGraph), and `gemma4_mtp_graph.cpp` is the most architecture-specific file in the stack.
NOT in this PR (deferred to PR #11)
--max-ctxsilently destroys attention throughput (20× prefill slowdown at 32K, 8× at 64K) #10 only declares the fields and the readers (MTP step graph); the writer wiring is PR fix(scripts): auto-fit --max-ctx to prompt size #11.`dflash/src/gemma4_target_graph.cpp` is untouched in this PR — the MTP-related changes to that file (h_prev capture, KV override) are decode-loop integration that PR #11 will introduce.
Review checklist (from spec)
Validation