feat(gemma4): add mtp loader and step graph by dusterbloom · Pull Request #182 · Luce-Org/lucebox-hub

dusterbloom · 2026-05-13T14:45:48Z

Validation (for review)

Upstream llama-bench baseline (Gemma4-31B-it Q4_K_M, q4_0 KV, FA on, RTX 3090): pp16384 = 376.62 tok/s, tg300 = 20.39 tok/s.

MTP assistant loader + per-step cross-attention graph. Plus: deletes per-token-per-layer [mtp-fa-types] stdout flood and a redundant ggml_cont on the non-TQ3 path.

Metric	Without PR	With PR
MTP graph builds	✗	✓ (4 MTP layers, 322 MiB GPU)
Dense MTP @ 16k decode	n/a	21.4 tok/s (γ=1; γ=2 follow-up)
Stdout pollution per MTP token	4-8 lines × n_layer	0
Byte-identical at γ=1 vs AR (non-accepted tokens)	n/a	✓

PR #10 of the Gemma4 split sequence. Final non-daemon PR. Adds Gemma4 MTP (Multi-Token Prediction) assistant loading and one-step graph construction. Not wired into the decode loop — that integration lands with deferred PR #11.

Depends on #181 → #180 → #179 → #177 → #176 → #171 → #170 → #169 → #168. Stacks on `split/gemma4-09-dflash-draft-runtime`; diff includes ancestor commits until they merge.

Scope (6 files, +1761 / -0)

`dflash/src/gemma4_target_loader.cpp` — +447 lines, purely additive. Re-adds the four MTP loader/helper symbols that were intentionally stripped from this file in PR feat(gemma4): add target API and GGUF loader #171 (per the PR04 spec):
- `load_gemma4_mtp_assistant(path, backend, out)`
- `free_gemma4_mtp_assistant(w)`
- `get_mtp_swa_pattern(path, out)`
- `resolve_mtp_donor_layers(mtp, target_swa_layers)`
  None of the PR feat(gemma4): add target API and GGUF loader #171 target-loader behavior is reopened.
`dflash/src/gemma4_mtp_graph.cpp` — new file, 760 lines. Defines `build_mtp_step_graph(...)` (the single-step MTP forward) and `free_mtp_step_graph(...)` (cleanup). Each MTP layer reads target K/V from `w.layers[il].donor_target_layer` (resolved at load time).
`dflash/src/internal.h` — +116 lines. Adds:
- MTP h_prev fields to `GemmaTargetCache`: `mtp_h_prev`, `mtp_h_prev_enabled`, `mtp_last_full_layer`, `mtp_h_prev_row`, `mtp_h_prev_batch`, `mtp_h_prev_capture_mode`. Default-off; only allocated when MTP enabled.
- `MtpLayerWeights` struct (Q-only attention; V always read from donor target KV; `attention_k_eq_v` quirk per atomicbot:gemma4-assistant.cpp).
- `MtpDrafterWeights` struct (pre/post projection, output norm, tied tok_embd, optional centroids).
- `MtpStepGraph` struct (one-step graph state).
- 6 function prototypes (loader + helpers + graph build/free).
`dflash/test/gemma4/test_mtp_loader.cpp` — new, 128 lines. Loads an assistant GGUF, validates loaded weights + donor-layer resolution.
`dflash/test/gemma4/test_mtp_graph_shapes.cpp` — new, 298 lines. Validates the one-step graph's input/output tensor shapes against the spec, without running compute.
`dflash/CMakeLists.txt` — wires `src/gemma4_mtp_graph.cpp` into `dflash27b` and adds both test targets.

Risk: HIGH

First introduction of the MTP code surface. The loader bumps `internal.h` with a sizable struct surface (MtpLayerWeights, MtpDrafterWeights, MtpStepGraph), and `gemma4_mtp_graph.cpp` is the most architecture-specific file in the stack.

NOT in this PR (deferred to PR #11)

Wiring MTP into the decode loop (`--draft-method mtp` flag, γ>1 chain decode).
MTP h_prev capture block inside `build_gemma4_graph` — this is the WRITE side that fills `cache.mtp_h_prev`. PR Oversized --max-ctx silently destroys attention throughput (20× prefill slowdown at 32K, 8× at 64K) #10 only declares the fields and the readers (MTP step graph); the writer wiring is PR fix(scripts): auto-fit --max-ctx to prompt size #11.
The asymmetric KV / MTP donor override path in `create_gemma4_cache` — kept stripped from PR feat(gemma4): add target graph execution #176 since it's part of decode-loop runtime, not loader/graph construction.
`test_gemma4_dflash.cpp` runtime integration for `--draft-method mtp`.

`dflash/src/gemma4_target_graph.cpp` is untouched in this PR — the MTP-related changes to that file (h_prev capture, KV override) are decode-loop integration that PR #11 will introduce.

Review checklist (from spec)

MTP assistant loader is reviewable independently from runtime decode ✓ (`load_gemma4_mtp_assistant` is a pure-mmap/parse function; `test_mtp_loader` exercises it standalone)
Graph-shape test covers the one-step graph ✓ (`test_mtp_graph_shapes`)
MTP donor layer resolution is deterministic and validated ✓ (`resolve_mtp_donor_layers` walks `target_swa_layers` and matches by attention type; loader test asserts the result)
No `test_gemma4_dflash` CLI behavior changes ✓ (file untouched in this PR)

Validation

`g++ -fsyntax-only gemma4_target_loader.cpp` against new `internal.h` → exit 0
`g++ -fsyntax-only gemma4_mtp_graph.cpp` → exit 0
Full `cmake --build test_mtp_loader test_mtp_graph_shapes` deferred to CI (no local CUDA build budget this turn)
Runtime invocation requires a Gemma4 MTP assistant GGUF on disk.

cubic-dev-ai

1 issue found across 21 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/test/test_flash_attn_sparse.cpp">

<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:114">
P2: The dense-vs-sparse equality check is far too loose; `max_diff < 1.0f` can let major incorrect outputs pass even though the test claims exact match.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

The test calls pflash_register_ggml_kernel(), which is defined in src/pflash_ggml_adapter.cpp. That source is only compiled into dflash27b on CUDA + sm>=80 (the elseif branch at line ~291). The test target was unguarded, so CI runners building at sm<80 successfully built dflash27b without the adapter, then failed to link the test: undefined reference to `pflash_register_ggml_kernel()' Mirror the same guard used by test_flashprefill_kernels at line ~396.

Plain-C++ compile units that include internal.h (e.g. smoke tests built with g++ rather than nvcc) hit a fatal error on the unconditional #include <cuda_runtime.h> guarded by GGML_USE_CUDA. The runtime header is only needed by src/cuda_cross_device_copy.cpp (the implementation TU), which already includes it directly. Replace the header include with a forward declaration of cudaStream_t (typedef struct CUstream_st*) so consumers of the dflash_cuda_copy_ between_devices prototype don't need CUDA include paths. Found via CI failure on the smoke_load_gemma4_target / smoke_gemma4_target_forward targets after CI started reaching them following the test_flash_attn_sparse cmake guard fix.

The smoke driver was leaving GemmaGraphInputs::swa_mask null, causing gemma4_target_graph SWA layers to fall back to attn_mask via the effective_mask = swa_mask ?: attn_mask path. attn_mask is sized for the full-attn view (kv_len_padded), but SWA layers view the full swa_ctx_alloc slots, so flash attention reads past the end of attn_mask into adjacent GPU memory. Manifests as all-NaN logits with Q4_0 / TQ3_0 KV (the OOB bytes are interpreted as fp16 mask values added to attention scores). Q8_0 tolerated the OOB read by accident; Q4_0 / TQ3_0 do not. The bug is documented in gemma4_runtime_helpers.cpp:124-130 — the runtime helper used by the daemon path sets swa_mask correctly. The smoke driver bypassed that helper. Allocate swa_mask sized [align_up(swa_ctx_alloc, 256), n_tokens] and fill the same causal pattern as attn_mask. Caught by running the test on a real GPU (CI only compiles, doesn't run). No production code changed.

The test stub allocated GemmaTargetCache::attn_k/attn_v with max_ctx=64, but production create_gemma4_cache (gemma4_target_graph.cpp:570) 256-aligns max_ctx for full-attention donor layers (head_dim>=512 requires it for the FA view padding contract). With max_ctx=64, build_mtp_step_graph for the first full-attention donor layer (layer 3 in Dense 31B's SWA-odd pattern, donor=58) tried to ggml_view_3d 256 rows from a 64-row parent and tripped GGML_ASSERT(data_size + view_offs <= ggml_nbytes(view_src)) in ggml.c:1748. Bumping the test stub to max_ctx=256 mirrors production's 256-alignment policy. No production code changed; production was never affected since its allocator already handles this correctly. Caught by running the test on a real GPU (CI only compiles, doesn't run).

The [mtp-fa-types] printf in build_mtp_step_graph fires inside the per-layer loop of the per-token MTP step graph. With 4-8 MTP layers and one print per layer per decode token, it writes thousands of lines of stdout during decode — and the daemon's banner protocol shares the same stdout channel. Delete the diagnostic; the FA-type computation it inspected is still exercised by test_mtp_graph_shapes if needed for future debugging. See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Finding S1).

cubic-dev-ai

3 issues found across 19 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/test/gemma4/test_mtp_graph_shapes.cpp">

<violation number="1" location="dflash/test/gemma4/test_mtp_graph_shapes.cpp:80">
P1: Shape-only test still allocates the full `tok_embd` table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.</violation>
</file>

<file name="dflash/CMakeLists.txt">

<violation number="1" location="dflash/CMakeLists.txt:452">
P1: New Gemma4/MTP test targets hardcode `ggml-cuda` instead of using the backend-agnostic `${DFLASH27B_GGML_BACKEND_TARGET}`, breaking HIP test builds.</violation>
</file>

<file name="dflash/src/gemma4_target_graph.cpp">

<violation number="1" location="dflash/src/gemma4_target_graph.cpp:976">
P2: PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be il*nb[1] (not il*n_tokens*nb[1]) and view height should be 1 (not n_tokens).</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-05-14T12:21:56Z

+    // optional; nullptr → falls back to base rope_theta scaling).
+
+    out.backend = backend;
+    out.buf = ggml_backend_alloc_ctx_tensors(out.ctx, backend);


P1: Shape-only test still allocates the full tok_embd table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/gemma4/test_mtp_graph_shapes.cpp, line 80: <comment>Shape-only test still allocates the full `tok_embd` table on GPU, which can OOM CI and makes the smoke test unnecessarily expensive.</comment> <file context> @@ -0,0 +1,298 @@ + // optional; nullptr → falls back to base rope_theta scaling). + + out.backend = backend; + out.buf = ggml_backend_alloc_ctx_tensors(out.ctx, backend); + if (!out.buf) { ggml_free(out.ctx); out.ctx = nullptr; return false; } + </file context>

cubic-dev-ai · 2026-05-14T12:21:56Z

+    if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_load_gemma4_target.cpp")
+        add_executable(smoke_load_gemma4_target test/gemma4/smoke_load_gemma4_target.cpp)
+        target_include_directories(smoke_load_gemma4_target PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS})
+        target_link_libraries(smoke_load_gemma4_target PRIVATE dflash27b ggml ggml-cuda)


P1: New Gemma4/MTP test targets hardcode ggml-cuda instead of using the backend-agnostic ${DFLASH27B_GGML_BACKEND_TARGET}, breaking HIP test builds.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/CMakeLists.txt, line 452: <comment>New Gemma4/MTP test targets hardcode `ggml-cuda` instead of using the backend-agnostic `${DFLASH27B_GGML_BACKEND_TARGET}`, breaking HIP test builds.</comment> <file context> @@ -436,6 +446,41 @@ if(DFLASH27B_TESTS) + if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_load_gemma4_target.cpp") + add_executable(smoke_load_gemma4_target test/gemma4/smoke_load_gemma4_target.cpp) + target_include_directories(smoke_load_gemma4_target PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS}) + target_link_libraries(smoke_load_gemma4_target PRIVATE dflash27b ggml ggml-cuda) + endif() + if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/gemma4/smoke_gemma4_target_forward.cpp") </file context>

cubic-dev-ai · 2026-05-14T12:21:56Z

+            const int n_embd_per_layer = w.n_embd_per_layer > 0 ? w.n_embd_per_layer
+                                                                  : (int)in.per_layer_inp->ne[0];
+            ggml_tensor * ple_emb;
+            if (ggml_n_dims(in.per_layer_inp) >= 3 || (int)in.per_layer_inp->ne[1] == w.n_layer) {


P2: PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be ilnb[1] (not iln_tokens*nb[1]) and view height should be 1 (not n_tokens).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/gemma4_target_graph.cpp, line 976: <comment>PLE slice computes incorrect view geometry for 2D [n_embd_per_layer, n_layer] tensors: offset should be il*nb[1] (not il*n_tokens*nb[1]) and view height should be 1 (not n_tokens).</comment> <file context> @@ -0,0 +1,1073 @@ + const int n_embd_per_layer = w.n_embd_per_layer > 0 ? w.n_embd_per_layer + : (int)in.per_layer_inp->ne[0]; + ggml_tensor * ple_emb; + if (ggml_n_dims(in.per_layer_inp) >= 3 || (int)in.per_layer_inp->ne[1] == w.n_layer) { + // Shape [n_embd_per_layer, n_layer] or [n_embd_per_layer, n_tokens, n_layer] + ple_emb = ggml_view_2d(ctx, in.per_layer_inp, </file context>

ggml_flash_attn_ext returns a contiguous tensor by spec. The unconditional `ggml_cont(ctx, attn_out)` at the end of every MTP layer was a no-op kernel for the non-TQ3 path (the v_is_tq3 branch above already inserts its own cont before turbo_wht). Gate the cont on v_is_tq3 to skip the kernel when not needed. Also remove the now-unused `kv_is_tq3` local that became dead after the [mtp-fa-types] printf was deleted (its only consumer). See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Finding S8).

…nsors_est The hoisted layer-0 donor KV validation at the top of build_mtp_step_graph was 100% duplicated by the in-loop validation that iterates il = 0..n_layer-1 (which already runs the same donor_target_layer / kv_read_slot bounds checks for layer 0 on its first iteration). Delete the hoisted block; preserve the in-loop check unchanged. No semantic change — the error-set messages from the in-loop path are equivalent and include the layer index. Also document the n_tensors_est formula's "80 ops per layer" magic number with a per-op breakdown so a future maintainer doesn't have to reverse-engineer it. See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Additional findings — defensive checks that never fire / magic numbers).

Record fresh worktree probes for PRs Luce-Org#182, Luce-Org#181, Luce-Org#180, Luce-Org#154, and Luce-Org#131, including Codex feasibility reports for the selective-port candidates and retained audit worktree paths.

Record the 2026-05-28 06:33Z unattended refresh, direct merge probes, and PR Luce-Org#182 delegated feasibility results.

dusterbloom · 2026-05-28T18:09:04Z

Closing: targets obsolete dflash/src/gemma4_mtp_*.cpp paths. Gemma4 MTP would now be implemented as a concrete IExternalDrafterMtp module against the interface declared in #237 — fresh implementation, not a port. The tensor-layout / asymmetric-KV knowledge from this branch remains useful as reference if/when that follow-up is picked up.

cubic-dev-ai Bot reviewed May 13, 2026

View reviewed changes

Comment thread dflash/test/test_flash_attn_sparse.cpp Outdated

dusterbloom force-pushed the split/gemma4-10-mtp-loader-step-graph branch from 6e884e0 to ff15c4d Compare May 13, 2026 14:54

dusterbloom mentioned this pull request May 13, 2026

feat(gemma4): target-graph MTP integration (h_prev capture + asymmetric KV) #183

Closed

dusterbloom force-pushed the split/gemma4-10-mtp-loader-step-graph branch 2 times, most recently from 64bd65d to 797b56b Compare May 13, 2026 15:58

dusterbloom and others added 6 commits May 13, 2026 21:14

feat(dflash): add sparse flash-attention adapter

178363d

Update dflash/test/test_flash_attn_sparse.cpp

815e1b7

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

feat(gemma4): add target API and GGUF loader

5be6367

feat(gemma4): add target graph execution

78971b8

dusterbloom force-pushed the split/gemma4-10-mtp-loader-step-graph branch from 797b56b to 7009729 Compare May 13, 2026 19:16

dusterbloom added 7 commits May 13, 2026 22:42

fix(gemma4): add long-context KV correctness

167ae83

feat(gemma4): route target prefill through pflash

c8cb9d8

feat(gemma4): add draft loader and quantization support

44bb2ce

feat(gemma4): add dflash draft runtime

6756002

feat(gemma4): add mtp loader and step graph

a475ece

dusterbloom force-pushed the split/gemma4-10-mtp-loader-step-graph branch from 7009729 to a475ece Compare May 13, 2026 20:43

cubic-dev-ai Bot reviewed May 14, 2026

View reviewed changes

dusterbloom added 2 commits May 14, 2026 14:49

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026

docs: refresh auto-integration manifest

d735948

Record the 2026-05-28 06:33Z unattended refresh, direct merge probes, and PR Luce-Org#182 delegated feasibility results.

dusterbloom closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gemma4): add mtp loader and step graph#182

feat(gemma4): add mtp loader and step graph#182
dusterbloom wants to merge 16 commits into
Luce-Org:mainfrom
dusterbloom:split/gemma4-10-mtp-loader-step-graph

dusterbloom commented May 13, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 14, 2026

Uh oh!

cubic-dev-ai Bot May 14, 2026

Uh oh!

cubic-dev-ai Bot May 14, 2026

Uh oh!

dusterbloom commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dusterbloom commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation (for review)

Scope (6 files, +1761 / -0)

Risk: HIGH

NOT in this PR (deferred to PR #11)

Review checklist (from spec)

Validation

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

dusterbloom commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented May 13, 2026 •

edited

Loading