feat(gemma4): add target graph execution by dusterbloom · Pull Request #176 · Luce-Org/lucebox-hub

dusterbloom · 2026-05-13T13:01:34Z

Validation (for review)

Upstream llama-bench baseline (Gemma4-31B-it Q4_K_M, q4_0 KV, FA on, RTX 3090): pp16384 = 376.62 tok/s, tg300 = 20.39 tok/s.

Gemma4 target forward in pure ggml — SWA + full-attn, proportional RoPE, PLE, MoE FFN, shared KV.

Metric	Without PR	With PR
AR + Q4 @ 16k prefill	n/a	678 tok/s (+80% vs upstream 376.62)
AR + Q4 @ 16k decode	n/a	22.6 tok/s (+11% vs upstream 20.39)
Byte-identical to upstream `llama-cli` AR Q4 @ temp=0	n/a	✓ (Q4 lossless)

PR #5 of the Gemma4 split sequence. Adds the basic Gemma4 target graph + cache lifecycle. No pFlash, no DFlash draft, no MTP — those land in PRs #7, #9, #10 respectively.

Depends on #171 → #170 → #169 → #168. Stacks on split/gemma4-04-api-target-loader; the diff visible here will include those ancestor commits until they merge.

Scope (4 files exactly per `pr05-target-graph.md`)

dflash/src/gemma4_target_graph.cpp — basic graph builders (rms_norm_mul, build_geglu_ffn, build_moe_ffn, build_swa_attn_block, build_full_attn_block, build_gemma4_graph) + cache lifecycle (create_gemma4_cache, free_gemma4_cache, reset_gemma4_cache) + compute_swa_view.
dflash/src/internal.h — adds GemmaTargetCache, GemmaGraphInputs, GemmaGraphOutputs, SwaView structs + 5 prototypes (no draft/MTP/pflash fields).
dflash/test/gemma4/smoke_gemma4_target_forward.cpp — minimal forward smoke (uses the stripped 4-arg create_gemma4_cache API).
dflash/CMakeLists.txt — wire the new source into dflash27b and add the smoke_gemma4_target_forward target.

What was stripped from the source branch (relative to `feature/gemma4-support`)

build_full_attn_block: removed use_pflash / pflash_alpha params and the entire ggml_flash_attn_sparse dispatch branch — only ggml_flash_attn_ext remains.
create_gemma4_cache: removed extra_q8_layers + enable_dflash_capture_overrides params and the asymmetric-KV / MTP-donor override blocks. Per-layer KV type is now a uniform assignment; PR feat: add DFlash for Windows #6 will reintroduce the asymmetric path.
free_gemma4_cache: dropped the free_draft_kv_cache(c) call.
reset_gemma4_cache: dropped c.draft_kv_pos = 0.
build_gemma4_graph: dropped the pflash args passed to build_full_attn_block and the entire MTP h_prev capture block (~50 lines post-final-norm).
Whole-function deletions: create_draft_kv_cache, free_draft_kv_cache.
GemmaTargetCache (in internal.h): no mtp_h_prev_* fields, no draft_kv_* fields.
GemmaGraphInputs: no use_pflash, no pflash_alpha.

Final source is 964 lines (down from 1189 in the feature branch).

Verification

rg -n \"pflash|flash_attn_sparse|MTP|h_prev|draft|g_kq_stride_pad|DFLASH_PFLASH_TQ3\" dflash/src/gemma4_target_graph.cpp → zero matches.
rg ... over the new internal.h section → zero matches.
g++ -c gemma4_target_graph.cpp via the project's compile_commands.json flags → exit 0, no warnings.
g++ -c gemma4_target_loader.cpp (PR04's loader) against the new internal.h → exit 0.
Full cmake --build dflash/build --target smoke_gemma4_target_forward exceeded a 5-min timeout on the dev host (likely scaling out across the dflash27b lib); per-TU compiles cover the changed code. CI is the authoritative end-to-end check.

Risk: HIGH (first execution path)

Review checklist (from spec)

Cache create/free/reset is symmetrical ✓ (paired alloc/free of base_ctx + base_buf)
Target graph can execute a minimal forward pass (smoke test exercises it)
No speculative-decode state added (verified by rg)
CMake adds only the target graph source + target-forward smoke (verified by scope check)

cubic-dev-ai

No issues found across 15 files

cubic-dev-ai

No issues found across 13 files

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

The test calls pflash_register_ggml_kernel(), which is defined in src/pflash_ggml_adapter.cpp. That source is only compiled into dflash27b on CUDA + sm>=80 (the elseif branch at line ~291). The test target was unguarded, so CI runners building at sm<80 successfully built dflash27b without the adapter, then failed to link the test: undefined reference to `pflash_register_ggml_kernel()' Mirror the same guard used by test_flashprefill_kernels at line ~396.

Plain-C++ compile units that include internal.h (e.g. smoke tests built with g++ rather than nvcc) hit a fatal error on the unconditional #include <cuda_runtime.h> guarded by GGML_USE_CUDA. The runtime header is only needed by src/cuda_cross_device_copy.cpp (the implementation TU), which already includes it directly. Replace the header include with a forward declaration of cudaStream_t (typedef struct CUstream_st*) so consumers of the dflash_cuda_copy_ between_devices prototype don't need CUDA include paths. Found via CI failure on the smoke_load_gemma4_target / smoke_gemma4_target_forward targets after CI started reaching them following the test_flash_attn_sparse cmake guard fix.

The smoke driver was leaving GemmaGraphInputs::swa_mask null, causing gemma4_target_graph SWA layers to fall back to attn_mask via the effective_mask = swa_mask ?: attn_mask path. attn_mask is sized for the full-attn view (kv_len_padded), but SWA layers view the full swa_ctx_alloc slots, so flash attention reads past the end of attn_mask into adjacent GPU memory. Manifests as all-NaN logits with Q4_0 / TQ3_0 KV (the OOB bytes are interpreted as fp16 mask values added to attention scores). Q8_0 tolerated the OOB read by accident; Q4_0 / TQ3_0 do not. The bug is documented in gemma4_runtime_helpers.cpp:124-130 — the runtime helper used by the daemon path sets swa_mask correctly. The smoke driver bypassed that helper. Allocate swa_mask sized [align_up(swa_ctx_alloc, 256), n_tokens] and fill the same causal pattern as attn_mask. Caught by running the test on a real GPU (CI only compiles, doesn't run). No production code changed.

…header Two small comment-hygiene cleanups surfaced by the slop audit: - TARGET_FEAT_CAP_DEFAULT = 4096 had no rationale comment. Add one that explains the floor sizing (covers full SWA window × a couple layers of DFlash lookback), notes the cap is then clamped to max_ctx, and flags the assumption swa_window <= 4096 for future maintenance. - The "// ── g) Output projection already done inside attn block ──" block was an empty section header with no code beneath. Delete the header; section (h) reads cleanly on its own. See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Additional findings — magic numbers / comments that explain what, not why).

…decode (supersedes PR Luce-Org#175 skeleton) Replaces howard0su's PR Luce-Org#175 Gemma4 skeleton with a feature-complete implementation. Howard's PR landed an AR-only backend (341 LoC) sufficient to validate the daemon protocol; this PR brings the production paths. What changes ------------ gemma4/gemma4_backend.{cpp,h} 341+91 → 1182+155 - decode_autoregressive (AR) - decode_dflash (speculative decode with DDTree drafter) - decode_mtp (Multi-Token Prediction, γ=1) - snapshot_save / snapshot_restore - park / unpark for speculative draft gemma4/gemma4_graph.cpp 448 → 1218 - iSWA target forward in pure ggml - target_feat capture (hidden states for DFlash draft KV prefill) - mtp_h_prev capture (post-output-norm hidden for MTP cross-attn) - sparse-FA dispatch via ggml_flash_attn_sparse (full-attn layers, decode-only) - F5 v2 gate (n_tokens==1) — CUDA dispatcher has no sparse kernel for head_dim=512+mask+S>1 (fattn.cu:572-576 abort) - Per-layer embedding decode handling gemma4/gemma4_loader.cpp 370 → 1197 - Per-layer-embedding (PLE) table loading - MoE expert metadata (top-k routing, expert_count, shared_exp) - Asymmetric KV override (MTP donor layers forced to Q8) gemma4/gemma4_daemon.{cpp,h} 29+20 → 43+46 - Extended Gemma4BackendConfig (draft_method, mtp_gamma, sparse_fa_alpha, draft_kv_cap_override, max_ctx) gemma4/gemma4_internal.h 184 → 17 (passthrough to ../internal.h) - Internal struct definitions consolidated into shared ../internal.h where they sit alongside Qwen3/Laguna/Qwen35 struct families (GemmaTargetWeights / GemmaTargetCache / MtpLayerWeights / SwaView / Gemma4GraphInputs / Gemma4GraphOutputs / GemmaDraftLayer / GemmaDraftWeights / MtpDrafterWeights / MtpStepGraph) - Howard's Gemma4Weights / Gemma4Cache / Gemma4Snapshot struct names are NOT carried — our pre-existing GemmaTarget* family is the canonical naming across all archs in this tree New files --------- gemma4/gemma4_runtime_helpers.{cpp,h} shared graph builders + masks gemma4_dflash_graph.cpp drafter step graph (top-level — separate model) gemma4_mtp_graph.cpp MTP step graph (top-level — separate head) include/gemma4.h public C-ish API surface Tests ----- test/gemma4/smoke_load_gemma4_target.cpp loader smoke (Dense 31B Q4_K_M) test/gemma4/smoke_load_gemma4_draft.cpp drafter smoke test/gemma4/smoke_gemma4_target_forward.cpp AR forward smoke test/gemma4/smoke_gemma4_draft_forward.cpp drafter forward smoke test/gemma4/test_gemma4_kv_tq3.cpp TQ3_0 KV cache correctness at 16k+ test/gemma4/test_mtp_loader.cpp MTP assistant GGUF loader test/gemma4/test_mtp_graph_shapes.cpp MTP step graph shape invariants Validated configurations (RTX 3090, Gemma4-31B-it Q4_K_M) -------------------------------------------------------- AR + Q4 KV @ 16k: 678 pp / 22.6 tg tok/s (+80% / +11% vs upstream) AR + TQ3 KV @ 16k: 739 pp / 16.6 tg MTP γ=1 + Q4 @ 16k: 691 pp / 21.4 tg Byte-identical to upstream llama-cli AR Q4 @ temp=0 greedy Naming clarification -------------------- The `use_pflash` / `pflash_alpha` / `s_pflash_tq3` / `DFLASH_PFLASH_TQ3` identifiers in this commit refer to the ggml_flash_attn_sparse op, NOT the PFlash product (Python speculative-prefill compression module at pflash/). A follow-up commit on this branch renames them to use_sparse_fa etc. to remove the ambiguity. Closes (when merged): Luce-Org#171, Luce-Org#176, Luce-Org#179, Luce-Org#184, Luce-Org#185 — those were the original split-PR series against pre-Luce-Org#175 main; reworked into this single PR after Luce-Org#175 brought the gemma4/ skeleton into main.

dusterbloom · 2026-05-14T19:09:20Z

Superseded by #193 — re-cut against current main after #175 brought the gemma4/ skeleton in. The split chain was overtaken by upstream work (PR #175 + #178 + #170 all merged during the rebase window), so re-opening as a single cohesive replacement PR is the cleaner path.

The content of this PR is included in #193. Closing here to consolidate review surface for mrciffa.

cubic-dev-ai Bot reviewed May 13, 2026

View reviewed changes

dusterbloom force-pushed the split/gemma4-05-target-graph branch from c40de83 to ecd5716 Compare May 13, 2026 14:54

dusterbloom mentioned this pull request May 13, 2026

feat(gemma4): target-graph MTP integration (h_prev capture + asymmetric KV) #183

Closed

dusterbloom force-pushed the split/gemma4-05-target-graph branch from 537f31a to a081e6c Compare May 13, 2026 15:58

cubic-dev-ai Bot reviewed May 13, 2026

View reviewed changes

dusterbloom and others added 6 commits May 13, 2026 21:14

feat(dflash): add sparse flash-attention adapter

178363d

Update dflash/test/test_flash_attn_sparse.cpp

815e1b7

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

feat(gemma4): add target API and GGUF loader

5be6367

feat(gemma4): add target graph execution

78971b8

dusterbloom force-pushed the split/gemma4-05-target-graph branch from a081e6c to 78971b8 Compare May 13, 2026 19:16

dusterbloom added 2 commits May 13, 2026 22:42

dusterbloom mentioned this pull request May 14, 2026

feat(gemma4): feature-complete backend with DFlash + MTP + sparse-FA decode (supersedes PR #175 skeleton) #193

Closed

dusterbloom closed this May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gemma4): add target graph execution#176

feat(gemma4): add target graph execution#176
dusterbloom wants to merge 8 commits into
Luce-Org:mainfrom
dusterbloom:split/gemma4-05-target-graph

dusterbloom commented May 13, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

dusterbloom commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dusterbloom commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation (for review)

Scope (4 files exactly per pr05-target-graph.md)

What was stripped from the source branch (relative to feature/gemma4-support)

Verification

Risk: HIGH (first execution path)

Review checklist (from spec)

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

dusterbloom commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented May 13, 2026 •

edited

Loading

Scope (4 files exactly per `pr05-target-graph.md`)

What was stripped from the source branch (relative to `feature/gemma4-support`)