Skip to content

feat(gemma4): add target graph execution#176

Closed
dusterbloom wants to merge 8 commits into
Luce-Org:mainfrom
dusterbloom:split/gemma4-05-target-graph
Closed

feat(gemma4): add target graph execution#176
dusterbloom wants to merge 8 commits into
Luce-Org:mainfrom
dusterbloom:split/gemma4-05-target-graph

Conversation

@dusterbloom

@dusterbloom dusterbloom commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Validation (for review)

Upstream llama-bench baseline (Gemma4-31B-it Q4_K_M, q4_0 KV, FA on, RTX 3090): pp16384 = 376.62 tok/s, tg300 = 20.39 tok/s.

Gemma4 target forward in pure ggml — SWA + full-attn, proportional RoPE, PLE, MoE FFN, shared KV.

Metric Without PR With PR
AR + Q4 @ 16k prefill n/a 678 tok/s (+80% vs upstream 376.62)
AR + Q4 @ 16k decode n/a 22.6 tok/s (+11% vs upstream 20.39)
Byte-identical to upstream llama-cli AR Q4 @ temp=0 n/a ✓ (Q4 lossless)

PR #5 of the Gemma4 split sequence. Adds the basic Gemma4 target graph + cache lifecycle. No pFlash, no DFlash draft, no MTP — those land in PRs #7, #9, #10 respectively.

Depends on #171#170#169#168. Stacks on split/gemma4-04-api-target-loader; the diff visible here will include those ancestor commits until they merge.

Scope (4 files exactly per pr05-target-graph.md)

  • dflash/src/gemma4_target_graph.cpp — basic graph builders (rms_norm_mul, build_geglu_ffn, build_moe_ffn, build_swa_attn_block, build_full_attn_block, build_gemma4_graph) + cache lifecycle (create_gemma4_cache, free_gemma4_cache, reset_gemma4_cache) + compute_swa_view.
  • dflash/src/internal.h — adds GemmaTargetCache, GemmaGraphInputs, GemmaGraphOutputs, SwaView structs + 5 prototypes (no draft/MTP/pflash fields).
  • dflash/test/gemma4/smoke_gemma4_target_forward.cpp — minimal forward smoke (uses the stripped 4-arg create_gemma4_cache API).
  • dflash/CMakeLists.txt — wire the new source into dflash27b and add the smoke_gemma4_target_forward target.

What was stripped from the source branch (relative to feature/gemma4-support)

  • build_full_attn_block: removed use_pflash / pflash_alpha params and the entire ggml_flash_attn_sparse dispatch branch — only ggml_flash_attn_ext remains.
  • create_gemma4_cache: removed extra_q8_layers + enable_dflash_capture_overrides params and the asymmetric-KV / MTP-donor override blocks. Per-layer KV type is now a uniform assignment; PR feat: add DFlash for Windows #6 will reintroduce the asymmetric path.
  • free_gemma4_cache: dropped the free_draft_kv_cache(c) call.
  • reset_gemma4_cache: dropped c.draft_kv_pos = 0.
  • build_gemma4_graph: dropped the pflash args passed to build_full_attn_block and the entire MTP h_prev capture block (~50 lines post-final-norm).
  • Whole-function deletions: create_draft_kv_cache, free_draft_kv_cache.
  • GemmaTargetCache (in internal.h): no mtp_h_prev_* fields, no draft_kv_* fields.
  • GemmaGraphInputs: no use_pflash, no pflash_alpha.

Final source is 964 lines (down from 1189 in the feature branch).

Verification

  • rg -n \"pflash|flash_attn_sparse|MTP|h_prev|draft|g_kq_stride_pad|DFLASH_PFLASH_TQ3\" dflash/src/gemma4_target_graph.cppzero matches.
  • rg ... over the new internal.h section → zero matches.
  • g++ -c gemma4_target_graph.cpp via the project's compile_commands.json flags → exit 0, no warnings.
  • g++ -c gemma4_target_loader.cpp (PR04's loader) against the new internal.hexit 0.
  • Full cmake --build dflash/build --target smoke_gemma4_target_forward exceeded a 5-min timeout on the dev host (likely scaling out across the dflash27b lib); per-TU compiles cover the changed code. CI is the authoritative end-to-end check.

Risk: HIGH (first execution path)

Review checklist (from spec)

  • Cache create/free/reset is symmetrical ✓ (paired alloc/free of base_ctx + base_buf)
  • Target graph can execute a minimal forward pass (smoke test exercises it)
  • No speculative-decode state added (verified by rg)
  • CMake adds only the target graph source + target-forward smoke (verified by scope check)

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 15 files

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 13 files

dusterbloom and others added 6 commits May 13, 2026 21:14
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
The test calls pflash_register_ggml_kernel(), which is defined in
src/pflash_ggml_adapter.cpp. That source is only compiled into
dflash27b on CUDA + sm>=80 (the elseif branch at line ~291). The test
target was unguarded, so CI runners building at sm<80 successfully
built dflash27b without the adapter, then failed to link the test:

    undefined reference to `pflash_register_ggml_kernel()'

Mirror the same guard used by test_flashprefill_kernels at line ~396.
Plain-C++ compile units that include internal.h (e.g. smoke tests built
with g++ rather than nvcc) hit a fatal error on the unconditional
#include <cuda_runtime.h> guarded by GGML_USE_CUDA. The runtime header
is only needed by src/cuda_cross_device_copy.cpp (the implementation
TU), which already includes it directly.

Replace the header include with a forward declaration of cudaStream_t
(typedef struct CUstream_st*) so consumers of the dflash_cuda_copy_
between_devices prototype don't need CUDA include paths.

Found via CI failure on the smoke_load_gemma4_target /
smoke_gemma4_target_forward targets after CI started reaching them
following the test_flash_attn_sparse cmake guard fix.
@dusterbloom dusterbloom force-pushed the split/gemma4-05-target-graph branch from a081e6c to 78971b8 Compare May 13, 2026 19:16
The smoke driver was leaving GemmaGraphInputs::swa_mask null, causing
gemma4_target_graph SWA layers to fall back to attn_mask via the
effective_mask = swa_mask ?: attn_mask path. attn_mask is sized for the
full-attn view (kv_len_padded), but SWA layers view the full swa_ctx_alloc
slots, so flash attention reads past the end of attn_mask into adjacent
GPU memory. Manifests as all-NaN logits with Q4_0 / TQ3_0 KV (the OOB
bytes are interpreted as fp16 mask values added to attention scores).

Q8_0 tolerated the OOB read by accident; Q4_0 / TQ3_0 do not. The bug is
documented in gemma4_runtime_helpers.cpp:124-130 — the runtime helper used
by the daemon path sets swa_mask correctly. The smoke driver bypassed that
helper.

Allocate swa_mask sized [align_up(swa_ctx_alloc, 256), n_tokens] and fill
the same causal pattern as attn_mask. Caught by running the test on a
real GPU (CI only compiles, doesn't run).

No production code changed.
…header

Two small comment-hygiene cleanups surfaced by the slop audit:

- TARGET_FEAT_CAP_DEFAULT = 4096 had no rationale comment. Add one
  that explains the floor sizing (covers full SWA window × a couple
  layers of DFlash lookback), notes the cap is then clamped to
  max_ctx, and flags the assumption swa_window <= 4096 for future
  maintenance.

- The "// ── g) Output projection already done inside attn block ──"
  block was an empty section header with no code beneath. Delete
  the header; section (h) reads cleanly on its own.

See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Additional findings
— magic numbers / comments that explain what, not why).
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 14, 2026
…decode (supersedes PR Luce-Org#175 skeleton)

Replaces howard0su's PR Luce-Org#175 Gemma4 skeleton with a feature-complete
implementation. Howard's PR landed an AR-only backend (341 LoC) sufficient
to validate the daemon protocol; this PR brings the production paths.

What changes
------------
gemma4/gemma4_backend.{cpp,h}  341+91   → 1182+155
  - decode_autoregressive (AR)
  - decode_dflash         (speculative decode with DDTree drafter)
  - decode_mtp            (Multi-Token Prediction, γ=1)
  - snapshot_save / snapshot_restore
  - park / unpark for speculative draft

gemma4/gemma4_graph.cpp        448 → 1218
  - iSWA target forward in pure ggml
  - target_feat capture (hidden states for DFlash draft KV prefill)
  - mtp_h_prev capture   (post-output-norm hidden for MTP cross-attn)
  - sparse-FA dispatch via ggml_flash_attn_sparse (full-attn layers, decode-only)
  - F5 v2 gate (n_tokens==1) — CUDA dispatcher has no sparse kernel
    for head_dim=512+mask+S>1 (fattn.cu:572-576 abort)
  - Per-layer embedding decode handling

gemma4/gemma4_loader.cpp       370 → 1197
  - Per-layer-embedding (PLE) table loading
  - MoE expert metadata (top-k routing, expert_count, shared_exp)
  - Asymmetric KV override (MTP donor layers forced to Q8)

gemma4/gemma4_daemon.{cpp,h}    29+20 → 43+46
  - Extended Gemma4BackendConfig (draft_method, mtp_gamma,
    sparse_fa_alpha, draft_kv_cap_override, max_ctx)

gemma4/gemma4_internal.h       184 → 17  (passthrough to ../internal.h)
  - Internal struct definitions consolidated into shared ../internal.h
    where they sit alongside Qwen3/Laguna/Qwen35 struct families
    (GemmaTargetWeights / GemmaTargetCache / MtpLayerWeights / SwaView /
    Gemma4GraphInputs / Gemma4GraphOutputs / GemmaDraftLayer /
    GemmaDraftWeights / MtpDrafterWeights / MtpStepGraph)
  - Howard's Gemma4Weights / Gemma4Cache / Gemma4Snapshot struct names
    are NOT carried — our pre-existing GemmaTarget* family is the
    canonical naming across all archs in this tree

New files
---------
gemma4/gemma4_runtime_helpers.{cpp,h}  shared graph builders + masks
gemma4_dflash_graph.cpp                drafter step graph (top-level — separate model)
gemma4_mtp_graph.cpp                   MTP step graph (top-level — separate head)
include/gemma4.h                       public C-ish API surface

Tests
-----
test/gemma4/smoke_load_gemma4_target.cpp   loader smoke (Dense 31B Q4_K_M)
test/gemma4/smoke_load_gemma4_draft.cpp    drafter smoke
test/gemma4/smoke_gemma4_target_forward.cpp  AR forward smoke
test/gemma4/smoke_gemma4_draft_forward.cpp   drafter forward smoke
test/gemma4/test_gemma4_kv_tq3.cpp         TQ3_0 KV cache correctness at 16k+
test/gemma4/test_mtp_loader.cpp            MTP assistant GGUF loader
test/gemma4/test_mtp_graph_shapes.cpp      MTP step graph shape invariants

Validated configurations (RTX 3090, Gemma4-31B-it Q4_K_M)
--------------------------------------------------------
  AR + Q4 KV @ 16k:     678 pp / 22.6 tg tok/s  (+80% / +11% vs upstream)
  AR + TQ3 KV @ 16k:    739 pp / 16.6 tg
  MTP γ=1 + Q4 @ 16k:   691 pp / 21.4 tg
  Byte-identical to upstream llama-cli AR Q4 @ temp=0 greedy

Naming clarification
--------------------
The `use_pflash` / `pflash_alpha` / `s_pflash_tq3` / `DFLASH_PFLASH_TQ3`
identifiers in this commit refer to the ggml_flash_attn_sparse op, NOT
the PFlash product (Python speculative-prefill compression module at
pflash/). A follow-up commit on this branch renames them to use_sparse_fa
etc. to remove the ambiguity.

Closes (when merged): Luce-Org#171, Luce-Org#176, Luce-Org#179, Luce-Org#184, Luce-Org#185 — those were the
original split-PR series against pre-Luce-Org#175 main; reworked into this single
PR after Luce-Org#175 brought the gemma4/ skeleton into main.
@dusterbloom

Copy link
Copy Markdown
Collaborator Author

Superseded by #193 — re-cut against current main after #175 brought the gemma4/ skeleton in. The split chain was overtaken by upstream work (PR #175 + #178 + #170 all merged during the rebase window), so re-opening as a single cohesive replacement PR is the cleaner path.

The content of this PR is included in #193. Closing here to consolidate review surface for mrciffa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant