Skip to content

fix(gemma4): add long-context KV correctness#177

Closed
dusterbloom wants to merge 9 commits into
Luce-Org:mainfrom
dusterbloom:split/gemma4-06-kv-correctness
Closed

fix(gemma4): add long-context KV correctness#177
dusterbloom wants to merge 9 commits into
Luce-Org:mainfrom
dusterbloom:split/gemma4-06-kv-correctness

Conversation

@dusterbloom

@dusterbloom dusterbloom commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Validation (for review)

Upstream llama-bench baseline (Gemma4-31B-it Q4_K_M, q4_0 KV, FA on, RTX 3090): pp16384 = 376.62 tok/s, tg300 = 20.39 tok/s.

Long-context KV correctness for TQ3 KV (ring-buffer + SWA boundary). Plus: ggml_backend_buffer_clear replaces per-tensor 1 MiB chunked memset on cache reset (matches llama.cpp convention).

Metric Without PR With PR
TQ3 KV decode at 16k+ ⚠ undefined (KV drift) ✓ correct
AR + TQ3+pFlash @ 16k n/a 739 pp / 16.6 tg tok/s
AR + TQ3+pFlash @ 64k n/a 474 pp / 8.1 tg tok/s
Cache-reset I/O seconds (chunked memset) 1 DMA op
Byte-identical to upstream Q4 @ temp=0 n/a

PR #6 of the Gemma4 split sequence. Adds the targeted test that validates long-context KV correctness for the Gemma4 target graph.

Depends on #176#171#170#169#168. Stacks on split/gemma4-05-target-graph; the diff visible here will include those ancestor commits until they merge.

Scope (2 files)

  • dflash/test/gemma4/test_gemma4_kv_tq3.cpp — new test (179 lines) covering TQ3-quantized KV, SWA ring wrap, decode mask for n_tokens == 1. Uses the 4-arg create_gemma4_cache(w, max_ctx, backend, cache) API.
  • dflash/CMakeLists.txt — wires the test_gemma4_kv_tq3 target (CUDA-linked, like the other gemma4 tests).

Note on the implementation hunks listed in pr06-kv-correctness.md

The PR06 spec lists four implementation-side categories of changes to gemma4_target_graph.cpp / internal.h:

  • SWA ring/window behavior
  • target KV cache alignment + wrapping
  • TQ3-compatible K/V handling
  • decode mask fixes for n_tokens == 1

While extracting PR #5 we found those four behaviors to be structurally inseparable from the basic target graph (the SWA ring sizing, TQ3 256-alignment, and per-layer KV type arrays are part of the same create_gemma4_cache block; the decode-mask path lives inside build_swa_attn_block / build_full_attn_block). Splitting them out would have produced a PR #5 that does not compile or execute end-to-end. They therefore landed with #5 (which scopes them honestly in its body).

This PR adds the test that proves those behaviors are correct. A git diff split/gemma4-05-target-graph..feature/gemma4-support -- dflash/src/gemma4_target_graph.cpp dflash/src/internal.h confirms that the remaining deltas to those files belong to PR #7 (pFlash), #9 (DFlash draft) and #10 (MTP) — there are no isolated KV-correctness hunks left in either file.

If maintainers prefer the spec's stricter split, the SWA-ring / TQ3-align code can be re-deferred from #5 to this PR in a re-roll. Flagging for discussion.

Risk: HIGH (test exercises long-context paths that drive the production decode loop)

Review checklist

  • Test is small, targeted, and runtime-validates the KV path (./test_gemma4_kv_tq3 <target.gguf>) ✓
  • SWA wrap behavior is separated from pFlash performance work (pFlash dispatch remains stripped — landing in feat(dflash): implement persistent daemon mode for server.py #7) ✓
  • No new runtime mode introduced ✓
  • No server / tokenizer surface changes ✓

Validation

  • g++ -fsyntax-only test_gemma4_kv_tq3.cpp with the project's CUDA-aware flags → exit 0.
  • Full cmake --build dflash/build --target test_gemma4_kv_tq3 was deferred to CI (the local 5-min build window on this host wasn't enough to relink dflash27b).
  • Runtime invocation requires a Gemma4 target GGUF on disk.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 16 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/test/test_flash_attn_sparse.cpp">

<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:114">
P2: The sparse-vs-dense check is far too lenient for a test that claims exact agreement; `max_diff < 1.0f` can hide real correctness regressions.</violation>
</file>

<file name="dflash/CMakeLists.txt">

<violation number="1" location="dflash/CMakeLists.txt:535">
P2: Two new Gemma4 CUDA tests (`smoke_gemma4_target_forward` and `test_gemma4_kv_tq3`) include `internal.h` which pulls in `<cuda_runtime.h>`, but they were omitted from `_dflash_internal_h_cuda_tests`, so they won't link `CUDA::cudart` and can fail to compile.</violation>
</file>

<file name="dflash/test/gemma4/test_gemma4_kv_tq3.cpp">

<violation number="1" location="dflash/test/gemma4/test_gemma4_kv_tq3.cpp:56">
P2: Test claims long-context KV correctness, but it only verifies cache allocation/metadata and never executes the decode/attention path.</violation>

<violation number="2" location="dflash/test/gemma4/test_gemma4_kv_tq3.cpp:116">
P2: Shared-layer donor validation is too weak: it only checks range, so an incorrect in-range donor mapping would still pass and hide KV wiring bugs.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/test/test_flash_attn_sparse.cpp Outdated
Comment thread dflash/CMakeLists.txt
smoke_draft_graph
test_vs_oracle
smoke_load_target
smoke_load_gemma4_target

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Two new Gemma4 CUDA tests (smoke_gemma4_target_forward and test_gemma4_kv_tq3) include internal.h which pulls in <cuda_runtime.h>, but they were omitted from _dflash_internal_h_cuda_tests, so they won't link CUDA::cudart and can fail to compile.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/CMakeLists.txt, line 535:

<comment>Two new Gemma4 CUDA tests (`smoke_gemma4_target_forward` and `test_gemma4_kv_tq3`) include `internal.h` which pulls in `<cuda_runtime.h>`, but they were omitted from `_dflash_internal_h_cuda_tests`, so they won't link `CUDA::cudart` and can fail to compile.</comment>

<file context>
@@ -511,6 +532,7 @@ if(DFLASH27B_TESTS)
         smoke_draft_graph
         test_vs_oracle
         smoke_load_target
+        smoke_load_gemma4_target
         smoke_load_target_laguna
         smoke_laguna_forward
</file context>
Suggested change
smoke_load_gemma4_target
smoke_load_gemma4_target
smoke_gemma4_target_forward
test_gemma4_kv_tq3

@@ -0,0 +1,179 @@
// Smoke test: create a Gemma4 KV cache with TQ3_0 quantization and validate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Test claims long-context KV correctness, but it only verifies cache allocation/metadata and never executes the decode/attention path.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/gemma4/test_gemma4_kv_tq3.cpp, line 56:

<comment>Test claims long-context KV correctness, but it only verifies cache allocation/metadata and never executes the decode/attention path.</comment>

<file context>
@@ -0,0 +1,179 @@
+
+    const int max_ctx = 1024;
+    GemmaTargetCache cache;
+    if (!create_gemma4_cache(w, max_ctx, backend, cache)) {
+        std::fprintf(stderr, "create_gemma4_cache: %s\n", dflash27b_last_error());
+        free_gemma4_target_weights(w);
</file context>

if (cache.layer_to_kv_idx[il] == -1) {
// This is a shared layer — must have a valid donor
const int donor = cache.layer_to_donor_kv[il];
if (donor < 0 || donor >= n_kv_slots) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Shared-layer donor validation is too weak: it only checks range, so an incorrect in-range donor mapping would still pass and hide KV wiring bugs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/gemma4/test_gemma4_kv_tq3.cpp, line 116:

<comment>Shared-layer donor validation is too weak: it only checks range, so an incorrect in-range donor mapping would still pass and hide KV wiring bugs.</comment>

<file context>
@@ -0,0 +1,179 @@
+        if (cache.layer_to_kv_idx[il] == -1) {
+            // This is a shared layer — must have a valid donor
+            const int donor = cache.layer_to_donor_kv[il];
+            if (donor < 0 || donor >= n_kv_slots) {
+                char buf[128];
+                std::snprintf(buf, sizeof(buf),
</file context>

dusterbloom and others added 6 commits May 13, 2026 21:14
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
The test calls pflash_register_ggml_kernel(), which is defined in
src/pflash_ggml_adapter.cpp. That source is only compiled into
dflash27b on CUDA + sm>=80 (the elseif branch at line ~291). The test
target was unguarded, so CI runners building at sm<80 successfully
built dflash27b without the adapter, then failed to link the test:

    undefined reference to `pflash_register_ggml_kernel()'

Mirror the same guard used by test_flashprefill_kernels at line ~396.
Plain-C++ compile units that include internal.h (e.g. smoke tests built
with g++ rather than nvcc) hit a fatal error on the unconditional
#include <cuda_runtime.h> guarded by GGML_USE_CUDA. The runtime header
is only needed by src/cuda_cross_device_copy.cpp (the implementation
TU), which already includes it directly.

Replace the header include with a forward declaration of cudaStream_t
(typedef struct CUstream_st*) so consumers of the dflash_cuda_copy_
between_devices prototype don't need CUDA include paths.

Found via CI failure on the smoke_load_gemma4_target /
smoke_gemma4_target_forward targets after CI started reaching them
following the test_flash_attn_sparse cmake guard fix.
@dusterbloom dusterbloom force-pushed the split/gemma4-06-kv-correctness branch from 2dec9fc to ce08231 Compare May 13, 2026 19:16
The smoke driver was leaving GemmaGraphInputs::swa_mask null, causing
gemma4_target_graph SWA layers to fall back to attn_mask via the
effective_mask = swa_mask ?: attn_mask path. attn_mask is sized for the
full-attn view (kv_len_padded), but SWA layers view the full swa_ctx_alloc
slots, so flash attention reads past the end of attn_mask into adjacent
GPU memory. Manifests as all-NaN logits with Q4_0 / TQ3_0 KV (the OOB
bytes are interpreted as fp16 mask values added to attention scores).

Q8_0 tolerated the OOB read by accident; Q4_0 / TQ3_0 do not. The bug is
documented in gemma4_runtime_helpers.cpp:124-130 — the runtime helper used
by the daemon path sets swa_mask correctly. The smoke driver bypassed that
helper.

Allocate swa_mask sized [align_up(swa_ctx_alloc, 256), n_tokens] and fill
the same causal pattern as attn_mask. Caught by running the test on a
real GPU (CI only compiles, doesn't run).

No production code changed.
@dusterbloom dusterbloom force-pushed the split/gemma4-06-kv-correctness branch from ce08231 to 167ae83 Compare May 13, 2026 20:43
create_gemma4_cache and reset_gemma4_cache each iterated every tensor
in the cache and wrote a 1 MiB zero buffer per tensor via
ggml_backend_tensor_set. On a multi-GB Gemma4-31B KV cache that's
thousands of H2D copies per cache reset — and reset runs per generate().

ggml_backend_buffer_clear does the same work in one DMA op. This is the
llama.cpp convention (src/llama-kv-cache.cpp:267-270, 337-340).

reset_gemma4_cache also: guard changed from `if (!c.base_ctx)` to
`if (!c.base_buf)` so a cache with allocated buffer but no ggml_context
is still cleared correctly.

See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Finding S9).
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Selectively ports the wrap-safe SWA K/V cache write from PR Luce-Org#177 into the current server/src/gemma4 layout. Refreshes the auto-integration manifest with current PR classification, probe logs, and validation notes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Record fresh open-PR probe results and the completed PR Luce-Org#177 Claude feasibility pass. No code changes were needed; current mechanical PR heads remain integrated and the remaining old-layout PRs still require selective ports.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Selectively ports PR Luce-Org#177's larger SWA ring allocation into the current server/src/gemma4 layout. A ring of exactly sliding_window slots can overwrite still-visible keys during chunked long-context prefill; use a 2x padded logical window capped at max_ctx while leaving remaining TQ3 KV alignment work for a supervised port.
@dusterbloom

Copy link
Copy Markdown
Collaborator Author

Closing: standalone dflash/src/gemma4_target_*.cpp backend is gone. Gemma4 now lives upstream at deps/llama.cpp/src/models/gemma4-iswa.cpp (vendored llama.cpp path), with spec-decode wiring via the server-side Gemma4 adapter (79abba9). The test harness API this PR targets (create_gemma4_cache) no longer exists. TQ3 KV correctness is now exercised through llama.cpp's own machinery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant