feat(gemma4): add target graph execution#176
Closed
dusterbloom wants to merge 8 commits into
Closed
Conversation
This was referenced May 13, 2026
c40de83 to
ecd5716
Compare
537f31a to
a081e6c
Compare
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
The test calls pflash_register_ggml_kernel(), which is defined in
src/pflash_ggml_adapter.cpp. That source is only compiled into
dflash27b on CUDA + sm>=80 (the elseif branch at line ~291). The test
target was unguarded, so CI runners building at sm<80 successfully
built dflash27b without the adapter, then failed to link the test:
undefined reference to `pflash_register_ggml_kernel()'
Mirror the same guard used by test_flashprefill_kernels at line ~396.
Plain-C++ compile units that include internal.h (e.g. smoke tests built with g++ rather than nvcc) hit a fatal error on the unconditional #include <cuda_runtime.h> guarded by GGML_USE_CUDA. The runtime header is only needed by src/cuda_cross_device_copy.cpp (the implementation TU), which already includes it directly. Replace the header include with a forward declaration of cudaStream_t (typedef struct CUstream_st*) so consumers of the dflash_cuda_copy_ between_devices prototype don't need CUDA include paths. Found via CI failure on the smoke_load_gemma4_target / smoke_gemma4_target_forward targets after CI started reaching them following the test_flash_attn_sparse cmake guard fix.
a081e6c to
78971b8
Compare
The smoke driver was leaving GemmaGraphInputs::swa_mask null, causing gemma4_target_graph SWA layers to fall back to attn_mask via the effective_mask = swa_mask ?: attn_mask path. attn_mask is sized for the full-attn view (kv_len_padded), but SWA layers view the full swa_ctx_alloc slots, so flash attention reads past the end of attn_mask into adjacent GPU memory. Manifests as all-NaN logits with Q4_0 / TQ3_0 KV (the OOB bytes are interpreted as fp16 mask values added to attention scores). Q8_0 tolerated the OOB read by accident; Q4_0 / TQ3_0 do not. The bug is documented in gemma4_runtime_helpers.cpp:124-130 — the runtime helper used by the daemon path sets swa_mask correctly. The smoke driver bypassed that helper. Allocate swa_mask sized [align_up(swa_ctx_alloc, 256), n_tokens] and fill the same causal pattern as attn_mask. Caught by running the test on a real GPU (CI only compiles, doesn't run). No production code changed.
…header Two small comment-hygiene cleanups surfaced by the slop audit: - TARGET_FEAT_CAP_DEFAULT = 4096 had no rationale comment. Add one that explains the floor sizing (covers full SWA window × a couple layers of DFlash lookback), notes the cap is then clamped to max_ctx, and flags the assumption swa_window <= 4096 for future maintenance. - The "// ── g) Output projection already done inside attn block ──" block was an empty section header with no code beneath. Delete the header; section (h) reads cleanly on its own. See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Additional findings — magic numbers / comments that explain what, not why).
dusterbloom
added a commit
to dusterbloom/lucebox-hub
that referenced
this pull request
May 14, 2026
…decode (supersedes PR Luce-Org#175 skeleton) Replaces howard0su's PR Luce-Org#175 Gemma4 skeleton with a feature-complete implementation. Howard's PR landed an AR-only backend (341 LoC) sufficient to validate the daemon protocol; this PR brings the production paths. What changes ------------ gemma4/gemma4_backend.{cpp,h} 341+91 → 1182+155 - decode_autoregressive (AR) - decode_dflash (speculative decode with DDTree drafter) - decode_mtp (Multi-Token Prediction, γ=1) - snapshot_save / snapshot_restore - park / unpark for speculative draft gemma4/gemma4_graph.cpp 448 → 1218 - iSWA target forward in pure ggml - target_feat capture (hidden states for DFlash draft KV prefill) - mtp_h_prev capture (post-output-norm hidden for MTP cross-attn) - sparse-FA dispatch via ggml_flash_attn_sparse (full-attn layers, decode-only) - F5 v2 gate (n_tokens==1) — CUDA dispatcher has no sparse kernel for head_dim=512+mask+S>1 (fattn.cu:572-576 abort) - Per-layer embedding decode handling gemma4/gemma4_loader.cpp 370 → 1197 - Per-layer-embedding (PLE) table loading - MoE expert metadata (top-k routing, expert_count, shared_exp) - Asymmetric KV override (MTP donor layers forced to Q8) gemma4/gemma4_daemon.{cpp,h} 29+20 → 43+46 - Extended Gemma4BackendConfig (draft_method, mtp_gamma, sparse_fa_alpha, draft_kv_cap_override, max_ctx) gemma4/gemma4_internal.h 184 → 17 (passthrough to ../internal.h) - Internal struct definitions consolidated into shared ../internal.h where they sit alongside Qwen3/Laguna/Qwen35 struct families (GemmaTargetWeights / GemmaTargetCache / MtpLayerWeights / SwaView / Gemma4GraphInputs / Gemma4GraphOutputs / GemmaDraftLayer / GemmaDraftWeights / MtpDrafterWeights / MtpStepGraph) - Howard's Gemma4Weights / Gemma4Cache / Gemma4Snapshot struct names are NOT carried — our pre-existing GemmaTarget* family is the canonical naming across all archs in this tree New files --------- gemma4/gemma4_runtime_helpers.{cpp,h} shared graph builders + masks gemma4_dflash_graph.cpp drafter step graph (top-level — separate model) gemma4_mtp_graph.cpp MTP step graph (top-level — separate head) include/gemma4.h public C-ish API surface Tests ----- test/gemma4/smoke_load_gemma4_target.cpp loader smoke (Dense 31B Q4_K_M) test/gemma4/smoke_load_gemma4_draft.cpp drafter smoke test/gemma4/smoke_gemma4_target_forward.cpp AR forward smoke test/gemma4/smoke_gemma4_draft_forward.cpp drafter forward smoke test/gemma4/test_gemma4_kv_tq3.cpp TQ3_0 KV cache correctness at 16k+ test/gemma4/test_mtp_loader.cpp MTP assistant GGUF loader test/gemma4/test_mtp_graph_shapes.cpp MTP step graph shape invariants Validated configurations (RTX 3090, Gemma4-31B-it Q4_K_M) -------------------------------------------------------- AR + Q4 KV @ 16k: 678 pp / 22.6 tg tok/s (+80% / +11% vs upstream) AR + TQ3 KV @ 16k: 739 pp / 16.6 tg MTP γ=1 + Q4 @ 16k: 691 pp / 21.4 tg Byte-identical to upstream llama-cli AR Q4 @ temp=0 greedy Naming clarification -------------------- The `use_pflash` / `pflash_alpha` / `s_pflash_tq3` / `DFLASH_PFLASH_TQ3` identifiers in this commit refer to the ggml_flash_attn_sparse op, NOT the PFlash product (Python speculative-prefill compression module at pflash/). A follow-up commit on this branch renames them to use_sparse_fa etc. to remove the ambiguity. Closes (when merged): Luce-Org#171, Luce-Org#176, Luce-Org#179, Luce-Org#184, Luce-Org#185 — those were the original split-PR series against pre-Luce-Org#175 main; reworked into this single PR after Luce-Org#175 brought the gemma4/ skeleton into main.
Collaborator
Author
|
Superseded by #193 — re-cut against current main after #175 brought the The content of this PR is included in #193. Closing here to consolidate review surface for mrciffa. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Validation (for review)
Upstream
llama-benchbaseline (Gemma4-31B-it Q4_K_M, q4_0 KV, FA on, RTX 3090): pp16384 = 376.62 tok/s, tg300 = 20.39 tok/s.Gemma4 target forward in pure ggml — SWA + full-attn, proportional RoPE, PLE, MoE FFN, shared KV.
llama-cliAR Q4 @ temp=0PR #5 of the Gemma4 split sequence. Adds the basic Gemma4 target graph + cache lifecycle. No pFlash, no DFlash draft, no MTP — those land in PRs #7, #9, #10 respectively.
Depends on #171 → #170 → #169 → #168. Stacks on
split/gemma4-04-api-target-loader; the diff visible here will include those ancestor commits until they merge.Scope (4 files exactly per
pr05-target-graph.md)dflash/src/gemma4_target_graph.cpp— basic graph builders (rms_norm_mul,build_geglu_ffn,build_moe_ffn,build_swa_attn_block,build_full_attn_block,build_gemma4_graph) + cache lifecycle (create_gemma4_cache,free_gemma4_cache,reset_gemma4_cache) +compute_swa_view.dflash/src/internal.h— addsGemmaTargetCache,GemmaGraphInputs,GemmaGraphOutputs,SwaViewstructs + 5 prototypes (no draft/MTP/pflash fields).dflash/test/gemma4/smoke_gemma4_target_forward.cpp— minimal forward smoke (uses the stripped 4-argcreate_gemma4_cacheAPI).dflash/CMakeLists.txt— wire the new source intodflash27band add thesmoke_gemma4_target_forwardtarget.What was stripped from the source branch (relative to
feature/gemma4-support)build_full_attn_block: removeduse_pflash/pflash_alphaparams and the entireggml_flash_attn_sparsedispatch branch — onlyggml_flash_attn_extremains.create_gemma4_cache: removedextra_q8_layers+enable_dflash_capture_overridesparams and the asymmetric-KV / MTP-donor override blocks. Per-layer KV type is now a uniform assignment; PR feat: add DFlash for Windows #6 will reintroduce the asymmetric path.free_gemma4_cache: dropped thefree_draft_kv_cache(c)call.reset_gemma4_cache: droppedc.draft_kv_pos = 0.build_gemma4_graph: dropped the pflash args passed tobuild_full_attn_blockand the entire MTPh_prevcapture block (~50 lines post-final-norm).create_draft_kv_cache,free_draft_kv_cache.GemmaTargetCache(ininternal.h): nomtp_h_prev_*fields, nodraft_kv_*fields.GemmaGraphInputs: nouse_pflash, nopflash_alpha.Final source is 964 lines (down from 1189 in the feature branch).
Verification
rg -n \"pflash|flash_attn_sparse|MTP|h_prev|draft|g_kq_stride_pad|DFLASH_PFLASH_TQ3\" dflash/src/gemma4_target_graph.cpp→ zero matches.rg ...over the newinternal.hsection → zero matches.g++ -c gemma4_target_graph.cppvia the project'scompile_commands.jsonflags → exit 0, no warnings.g++ -c gemma4_target_loader.cpp(PR04's loader) against the newinternal.h→ exit 0.cmake --build dflash/build --target smoke_gemma4_target_forwardexceeded a 5-min timeout on the dev host (likely scaling out across the dflash27b lib); per-TU compiles cover the changed code. CI is the authoritative end-to-end check.Risk: HIGH (first execution path)
Review checklist (from spec)
base_ctx+base_buf)