Skip to content

feat(gemma4): add mtp loader and step graph#182

Closed
dusterbloom wants to merge 16 commits into
Luce-Org:mainfrom
dusterbloom:split/gemma4-10-mtp-loader-step-graph
Closed

feat(gemma4): add mtp loader and step graph#182
dusterbloom wants to merge 16 commits into
Luce-Org:mainfrom
dusterbloom:split/gemma4-10-mtp-loader-step-graph

Conversation

@dusterbloom

@dusterbloom dusterbloom commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Validation (for review)

Upstream llama-bench baseline (Gemma4-31B-it Q4_K_M, q4_0 KV, FA on, RTX 3090): pp16384 = 376.62 tok/s, tg300 = 20.39 tok/s.

MTP assistant loader + per-step cross-attention graph. Plus: deletes per-token-per-layer [mtp-fa-types] stdout flood and a redundant ggml_cont on the non-TQ3 path.

Metric Without PR With PR
MTP graph builds ✓ (4 MTP layers, 322 MiB GPU)
Dense MTP @ 16k decode n/a 21.4 tok/s (γ=1; γ=2 follow-up)
Stdout pollution per MTP token 4-8 lines × n_layer 0
Byte-identical at γ=1 vs AR (non-accepted tokens) n/a

PR #10 of the Gemma4 split sequence. Final non-daemon PR. Adds Gemma4 MTP (Multi-Token Prediction) assistant loading and one-step graph construction. Not wired into the decode loop — that integration lands with deferred PR #11.

Depends on #181#180#179#177#176#171#170#169#168. Stacks on `split/gemma4-09-dflash-draft-runtime`; diff includes ancestor commits until they merge.

Scope (6 files, +1761 / -0)

  • `dflash/src/gemma4_target_loader.cpp` — +447 lines, purely additive. Re-adds the four MTP loader/helper symbols that were intentionally stripped from this file in PR feat(gemma4): add target API and GGUF loader #171 (per the PR04 spec):
    • `load_gemma4_mtp_assistant(path, backend, out)`
    • `free_gemma4_mtp_assistant(w)`
    • `get_mtp_swa_pattern(path, out)`
    • `resolve_mtp_donor_layers(mtp, target_swa_layers)`
      None of the PR feat(gemma4): add target API and GGUF loader #171 target-loader behavior is reopened.
  • `dflash/src/gemma4_mtp_graph.cpp` — new file, 760 lines. Defines `build_mtp_step_graph(...)` (the single-step MTP forward) and `free_mtp_step_graph(...)` (cleanup). Each MTP layer reads target K/V from `w.layers[il].donor_target_layer` (resolved at load time).
  • `dflash/src/internal.h` — +116 lines. Adds:
    • MTP h_prev fields to `GemmaTargetCache`: `mtp_h_prev`, `mtp_h_prev_enabled`, `mtp_last_full_layer`, `mtp_h_prev_row`, `mtp_h_prev_batch`, `mtp_h_prev_capture_mode`. Default-off; only allocated when MTP enabled.
    • `MtpLayerWeights` struct (Q-only attention; V always read from donor target KV; `attention_k_eq_v` quirk per atomicbot:gemma4-assistant.cpp).
    • `MtpDrafterWeights` struct (pre/post projection, output norm, tied tok_embd, optional centroids).
    • `MtpStepGraph` struct (one-step graph state).
    • 6 function prototypes (loader + helpers + graph build/free).
  • `dflash/test/gemma4/test_mtp_loader.cpp` — new, 128 lines. Loads an assistant GGUF, validates loaded weights + donor-layer resolution.
  • `dflash/test/gemma4/test_mtp_graph_shapes.cpp` — new, 298 lines. Validates the one-step graph's input/output tensor shapes against the spec, without running compute.
  • `dflash/CMakeLists.txt` — wires `src/gemma4_mtp_graph.cpp` into `dflash27b` and adds both test targets.

Risk: HIGH

First introduction of the MTP code surface. The loader bumps `internal.h` with a sizable struct surface (MtpLayerWeights, MtpDrafterWeights, MtpStepGraph), and `gemma4_mtp_graph.cpp` is the most architecture-specific file in the stack.

NOT in this PR (deferred to PR #11)

`dflash/src/gemma4_target_graph.cpp` is untouched in this PR — the MTP-related changes to that file (h_prev capture, KV override) are decode-loop integration that PR #11 will introduce.

Review checklist (from spec)

  • MTP assistant loader is reviewable independently from runtime decode ✓ (`load_gemma4_mtp_assistant` is a pure-mmap/parse function; `test_mtp_loader` exercises it standalone)
  • Graph-shape test covers the one-step graph ✓ (`test_mtp_graph_shapes`)
  • MTP donor layer resolution is deterministic and validated ✓ (`resolve_mtp_donor_layers` walks `target_swa_layers` and matches by attention type; loader test asserts the result)
  • No `test_gemma4_dflash` CLI behavior changes ✓ (file untouched in this PR)

Validation

  • `g++ -fsyntax-only gemma4_target_loader.cpp` against new `internal.h` → exit 0
  • `g++ -fsyntax-only gemma4_mtp_graph.cpp` → exit 0
  • Full `cmake --build test_mtp_loader test_mtp_graph_shapes` deferred to CI (no local CUDA build budget this turn)
  • Runtime invocation requires a Gemma4 MTP assistant GGUF on disk.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 21 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/test/test_flash_attn_sparse.cpp">

<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:114">
P2: The dense-vs-sparse equality check is far too loose; `max_diff < 1.0f` can let major incorrect outputs pass even though the test claims exact match.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/test/test_flash_attn_sparse.cpp Outdated
@dusterbloom dusterbloom force-pushed the split/gemma4-10-mtp-loader-step-graph branch from 6e884e0 to ff15c4d Compare May 13, 2026 14:54
@dusterbloom dusterbloom force-pushed the split/gemma4-10-mtp-loader-step-graph branch 2 times, most recently from 64bd65d to 797b56b Compare May 13, 2026 15:58
dusterbloom and others added 6 commits May 13, 2026 21:14
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
The test calls pflash_register_ggml_kernel(), which is defined in
src/pflash_ggml_adapter.cpp. That source is only compiled into
dflash27b on CUDA + sm>=80 (the elseif branch at line ~291). The test
target was unguarded, so CI runners building at sm<80 successfully
built dflash27b without the adapter, then failed to link the test:

    undefined reference to `pflash_register_ggml_kernel()'

Mirror the same guard used by test_flashprefill_kernels at line ~396.
Plain-C++ compile units that include internal.h (e.g. smoke tests built
with g++ rather than nvcc) hit a fatal error on the unconditional
#include <cuda_runtime.h> guarded by GGML_USE_CUDA. The runtime header
is only needed by src/cuda_cross_device_copy.cpp (the implementation
TU), which already includes it directly.

Replace the header include with a forward declaration of cudaStream_t
(typedef struct CUstream_st*) so consumers of the dflash_cuda_copy_
between_devices prototype don't need CUDA include paths.

Found via CI failure on the smoke_load_gemma4_target /
smoke_gemma4_target_forward targets after CI started reaching them
following the test_flash_attn_sparse cmake guard fix.
@dusterbloom dusterbloom force-pushed the split/gemma4-10-mtp-loader-step-graph branch from 797b56b to 7009729 Compare May 13, 2026 19:16
The smoke driver was leaving GemmaGraphInputs::swa_mask null, causing
gemma4_target_graph SWA layers to fall back to attn_mask via the
effective_mask = swa_mask ?: attn_mask path. attn_mask is sized for the
full-attn view (kv_len_padded), but SWA layers view the full swa_ctx_alloc
slots, so flash attention reads past the end of attn_mask into adjacent
GPU memory. Manifests as all-NaN logits with Q4_0 / TQ3_0 KV (the OOB
bytes are interpreted as fp16 mask values added to attention scores).

Q8_0 tolerated the OOB read by accident; Q4_0 / TQ3_0 do not. The bug is
documented in gemma4_runtime_helpers.cpp:124-130 — the runtime helper used
by the daemon path sets swa_mask correctly. The smoke driver bypassed that
helper.

Allocate swa_mask sized [align_up(swa_ctx_alloc, 256), n_tokens] and fill
the same causal pattern as attn_mask. Caught by running the test on a
real GPU (CI only compiles, doesn't run).

No production code changed.
The test stub allocated GemmaTargetCache::attn_k/attn_v with max_ctx=64,
but production create_gemma4_cache (gemma4_target_graph.cpp:570) 256-aligns
max_ctx for full-attention donor layers (head_dim>=512 requires it for
the FA view padding contract). With max_ctx=64, build_mtp_step_graph for
the first full-attention donor layer (layer 3 in Dense 31B's SWA-odd
pattern, donor=58) tried to ggml_view_3d 256 rows from a 64-row parent
and tripped GGML_ASSERT(data_size + view_offs <= ggml_nbytes(view_src))
in ggml.c:1748.

Bumping the test stub to max_ctx=256 mirrors production's 256-alignment
policy. No production code changed; production was never affected since
its allocator already handles this correctly.

Caught by running the test on a real GPU (CI only compiles, doesn't run).
@dusterbloom dusterbloom force-pushed the split/gemma4-10-mtp-loader-step-graph branch from 7009729 to a475ece Compare May 13, 2026 20:43
The [mtp-fa-types] printf in build_mtp_step_graph fires inside the
per-layer loop of the per-token MTP step graph. With 4-8 MTP layers and
one print per layer per decode token, it writes thousands of lines of
stdout during decode — and the daemon's banner protocol shares the same
stdout channel.

Delete the diagnostic; the FA-type computation it inspected is still
exercised by test_mtp_graph_shapes if needed for future debugging.

See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Finding S1).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 19 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/test/gemma4/test_mtp_graph_shapes.cpp">

<violation number="1" location="dflash/test/gemma4/test_mtp_graph_shapes.cpp:80">
P1: Shape-only test still allocates the full `tok_embd` table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.</violation>
</file>

<file name="dflash/CMakeLists.txt">

<violation number="1" location="dflash/CMakeLists.txt:452">
P1: New Gemma4/MTP test targets hardcode `ggml-cuda` instead of using the backend-agnostic `${DFLASH27B_GGML_BACKEND_TARGET}`, breaking HIP test builds.</violation>
</file>

<file name="dflash/src/gemma4_target_graph.cpp">

<violation number="1" location="dflash/src/gemma4_target_graph.cpp:976">
P2: PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be il*nb[1] (not il*n_tokens*nb[1]) and view height should be 1 (not n_tokens).</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

// optional; nullptr → falls back to base rope_theta scaling).

out.backend = backend;
out.buf = ggml_backend_alloc_ctx_tensors(out.ctx, backend);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Shape-only test still allocates the full tok_embd table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/gemma4/test_mtp_graph_shapes.cpp, line 80:

<comment>Shape-only test still allocates the full `tok_embd` table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.</comment>

<file context>
@@ -0,0 +1,298 @@
+    // optional; nullptr → falls back to base rope_theta scaling).
+
+    out.backend = backend;
+    out.buf = ggml_backend_alloc_ctx_tensors(out.ctx, backend);
+    if (!out.buf) { ggml_free(out.ctx); out.ctx = nullptr; return false; }
+
</file context>

Comment thread dflash/CMakeLists.txt
if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_load_gemma4_target.cpp")
add_executable(smoke_load_gemma4_target test/gemma4/smoke_load_gemma4_target.cpp)
target_include_directories(smoke_load_gemma4_target PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS})
target_link_libraries(smoke_load_gemma4_target PRIVATE dflash27b ggml ggml-cuda)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: New Gemma4/MTP test targets hardcode ggml-cuda instead of using the backend-agnostic ${DFLASH27B_GGML_BACKEND_TARGET}, breaking HIP test builds.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/CMakeLists.txt, line 452:

<comment>New Gemma4/MTP test targets hardcode `ggml-cuda` instead of using the backend-agnostic `${DFLASH27B_GGML_BACKEND_TARGET}`, breaking HIP test builds.</comment>

<file context>
@@ -436,6 +446,41 @@ if(DFLASH27B_TESTS)
+    if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_load_gemma4_target.cpp")
+        add_executable(smoke_load_gemma4_target test/gemma4/smoke_load_gemma4_target.cpp)
+        target_include_directories(smoke_load_gemma4_target PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS})
+        target_link_libraries(smoke_load_gemma4_target PRIVATE dflash27b ggml ggml-cuda)
+    endif()
+    if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_gemma4_target_forward.cpp")
</file context>

const int n_embd_per_layer = w.n_embd_per_layer > 0 ? w.n_embd_per_layer
: (int)in.per_layer_inp->ne[0];
ggml_tensor * ple_emb;
if (ggml_n_dims(in.per_layer_inp) >= 3 || (int)in.per_layer_inp->ne[1] == w.n_layer) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be ilnb[1] (not iln_tokens*nb[1]) and view height should be 1 (not n_tokens).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/gemma4_target_graph.cpp, line 976:

<comment>PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be il*nb[1] (not il*n_tokens*nb[1]) and view height should be 1 (not n_tokens).</comment>

<file context>
@@ -0,0 +1,1073 @@
+            const int n_embd_per_layer = w.n_embd_per_layer > 0 ? w.n_embd_per_layer
+                                                                  : (int)in.per_layer_inp->ne[0];
+            ggml_tensor * ple_emb;
+            if (ggml_n_dims(in.per_layer_inp) >= 3 || (int)in.per_layer_inp->ne[1] == w.n_layer) {
+                // Shape [n_embd_per_layer, n_layer] or [n_embd_per_layer, n_tokens, n_layer]
+                ple_emb = ggml_view_2d(ctx, in.per_layer_inp,
</file context>

ggml_flash_attn_ext returns a contiguous tensor by spec. The
unconditional `ggml_cont(ctx, attn_out)` at the end of every MTP layer
was a no-op kernel for the non-TQ3 path (the v_is_tq3 branch above
already inserts its own cont before turbo_wht). Gate the cont on
v_is_tq3 to skip the kernel when not needed.

Also remove the now-unused `kv_is_tq3` local that became dead after
the [mtp-fa-types] printf was deleted (its only consumer).

See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Finding S8).
…nsors_est

The hoisted layer-0 donor KV validation at the top of
build_mtp_step_graph was 100% duplicated by the in-loop validation
that iterates il = 0..n_layer-1 (which already runs the same
donor_target_layer / kv_read_slot bounds checks for layer 0 on its
first iteration). Delete the hoisted block; preserve the in-loop check
unchanged. No semantic change — the error-set messages from the
in-loop path are equivalent and include the layer index.

Also document the n_tensors_est formula's "80 ops per layer" magic
number with a per-op breakdown so a future maintainer doesn't have to
reverse-engineer it.

See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Additional findings
— defensive checks that never fire / magic numbers).
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Record fresh worktree probes for PRs Luce-Org#182, Luce-Org#181, Luce-Org#180, Luce-Org#154, and Luce-Org#131, including Codex feasibility reports for the selective-port candidates and retained audit worktree paths.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Record the 2026-05-28 06:33Z unattended refresh, direct merge probes, and PR Luce-Org#182 delegated feasibility results.
@dusterbloom

Copy link
Copy Markdown
Collaborator Author

Closing: targets obsolete dflash/src/gemma4_mtp_*.cpp paths. Gemma4 MTP would now be implemented as a concrete IExternalDrafterMtp module against the interface declared in #237 — fresh implementation, not a port. The tensor-layout / asymmetric-KV knowledge from this branch remains useful as reference if/when that follow-up is picked up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant