fix(gemma4): add long-context KV correctness#177
Conversation
There was a problem hiding this comment.
4 issues found across 16 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/test/test_flash_attn_sparse.cpp">
<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:114">
P2: The sparse-vs-dense check is far too lenient for a test that claims exact agreement; `max_diff < 1.0f` can hide real correctness regressions.</violation>
</file>
<file name="dflash/CMakeLists.txt">
<violation number="1" location="dflash/CMakeLists.txt:535">
P2: Two new Gemma4 CUDA tests (`smoke_gemma4_target_forward` and `test_gemma4_kv_tq3`) include `internal.h` which pulls in `<cuda_runtime.h>`, but they were omitted from `_dflash_internal_h_cuda_tests`, so they won't link `CUDA::cudart` and can fail to compile.</violation>
</file>
<file name="dflash/test/gemma4/test_gemma4_kv_tq3.cpp">
<violation number="1" location="dflash/test/gemma4/test_gemma4_kv_tq3.cpp:56">
P2: Test claims long-context KV correctness, but it only verifies cache allocation/metadata and never executes the decode/attention path.</violation>
<violation number="2" location="dflash/test/gemma4/test_gemma4_kv_tq3.cpp:116">
P2: Shared-layer donor validation is too weak: it only checks range, so an incorrect in-range donor mapping would still pass and hide KV wiring bugs.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| smoke_draft_graph | ||
| test_vs_oracle | ||
| smoke_load_target | ||
| smoke_load_gemma4_target |
There was a problem hiding this comment.
P2: Two new Gemma4 CUDA tests (smoke_gemma4_target_forward and test_gemma4_kv_tq3) include internal.h which pulls in <cuda_runtime.h>, but they were omitted from _dflash_internal_h_cuda_tests, so they won't link CUDA::cudart and can fail to compile.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/CMakeLists.txt, line 535:
<comment>Two new Gemma4 CUDA tests (`smoke_gemma4_target_forward` and `test_gemma4_kv_tq3`) include `internal.h` which pulls in `<cuda_runtime.h>`, but they were omitted from `_dflash_internal_h_cuda_tests`, so they won't link `CUDA::cudart` and can fail to compile.</comment>
<file context>
@@ -511,6 +532,7 @@ if(DFLASH27B_TESTS)
smoke_draft_graph
test_vs_oracle
smoke_load_target
+ smoke_load_gemma4_target
smoke_load_target_laguna
smoke_laguna_forward
</file context>
| smoke_load_gemma4_target | |
| smoke_load_gemma4_target | |
| smoke_gemma4_target_forward | |
| test_gemma4_kv_tq3 |
| @@ -0,0 +1,179 @@ | |||
| // Smoke test: create a Gemma4 KV cache with TQ3_0 quantization and validate | |||
There was a problem hiding this comment.
P2: Test claims long-context KV correctness, but it only verifies cache allocation/metadata and never executes the decode/attention path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/gemma4/test_gemma4_kv_tq3.cpp, line 56:
<comment>Test claims long-context KV correctness, but it only verifies cache allocation/metadata and never executes the decode/attention path.</comment>
<file context>
@@ -0,0 +1,179 @@
+
+ const int max_ctx = 1024;
+ GemmaTargetCache cache;
+ if (!create_gemma4_cache(w, max_ctx, backend, cache)) {
+ std::fprintf(stderr, "create_gemma4_cache: %s\n", dflash27b_last_error());
+ free_gemma4_target_weights(w);
</file context>
| if (cache.layer_to_kv_idx[il] == -1) { | ||
| // This is a shared layer — must have a valid donor | ||
| const int donor = cache.layer_to_donor_kv[il]; | ||
| if (donor < 0 || donor >= n_kv_slots) { |
There was a problem hiding this comment.
P2: Shared-layer donor validation is too weak: it only checks range, so an incorrect in-range donor mapping would still pass and hide KV wiring bugs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/gemma4/test_gemma4_kv_tq3.cpp, line 116:
<comment>Shared-layer donor validation is too weak: it only checks range, so an incorrect in-range donor mapping would still pass and hide KV wiring bugs.</comment>
<file context>
@@ -0,0 +1,179 @@
+ if (cache.layer_to_kv_idx[il] == -1) {
+ // This is a shared layer — must have a valid donor
+ const int donor = cache.layer_to_donor_kv[il];
+ if (donor < 0 || donor >= n_kv_slots) {
+ char buf[128];
+ std::snprintf(buf, sizeof(buf),
</file context>
e65ef76 to
01fec23
Compare
4c8dc20 to
2dec9fc
Compare
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
The test calls pflash_register_ggml_kernel(), which is defined in
src/pflash_ggml_adapter.cpp. That source is only compiled into
dflash27b on CUDA + sm>=80 (the elseif branch at line ~291). The test
target was unguarded, so CI runners building at sm<80 successfully
built dflash27b without the adapter, then failed to link the test:
undefined reference to `pflash_register_ggml_kernel()'
Mirror the same guard used by test_flashprefill_kernels at line ~396.
Plain-C++ compile units that include internal.h (e.g. smoke tests built with g++ rather than nvcc) hit a fatal error on the unconditional #include <cuda_runtime.h> guarded by GGML_USE_CUDA. The runtime header is only needed by src/cuda_cross_device_copy.cpp (the implementation TU), which already includes it directly. Replace the header include with a forward declaration of cudaStream_t (typedef struct CUstream_st*) so consumers of the dflash_cuda_copy_ between_devices prototype don't need CUDA include paths. Found via CI failure on the smoke_load_gemma4_target / smoke_gemma4_target_forward targets after CI started reaching them following the test_flash_attn_sparse cmake guard fix.
2dec9fc to
ce08231
Compare
The smoke driver was leaving GemmaGraphInputs::swa_mask null, causing gemma4_target_graph SWA layers to fall back to attn_mask via the effective_mask = swa_mask ?: attn_mask path. attn_mask is sized for the full-attn view (kv_len_padded), but SWA layers view the full swa_ctx_alloc slots, so flash attention reads past the end of attn_mask into adjacent GPU memory. Manifests as all-NaN logits with Q4_0 / TQ3_0 KV (the OOB bytes are interpreted as fp16 mask values added to attention scores). Q8_0 tolerated the OOB read by accident; Q4_0 / TQ3_0 do not. The bug is documented in gemma4_runtime_helpers.cpp:124-130 — the runtime helper used by the daemon path sets swa_mask correctly. The smoke driver bypassed that helper. Allocate swa_mask sized [align_up(swa_ctx_alloc, 256), n_tokens] and fill the same causal pattern as attn_mask. Caught by running the test on a real GPU (CI only compiles, doesn't run). No production code changed.
ce08231 to
167ae83
Compare
create_gemma4_cache and reset_gemma4_cache each iterated every tensor in the cache and wrote a 1 MiB zero buffer per tensor via ggml_backend_tensor_set. On a multi-GB Gemma4-31B KV cache that's thousands of H2D copies per cache reset — and reset runs per generate(). ggml_backend_buffer_clear does the same work in one DMA op. This is the llama.cpp convention (src/llama-kv-cache.cpp:267-270, 337-340). reset_gemma4_cache also: guard changed from `if (!c.base_ctx)` to `if (!c.base_buf)` so a cache with allocated buffer but no ggml_context is still cleared correctly. See dflash/docs/gemma4-pr-split/pr13-slop-audit.md (Finding S9).
Selectively ports the wrap-safe SWA K/V cache write from PR Luce-Org#177 into the current server/src/gemma4 layout. Refreshes the auto-integration manifest with current PR classification, probe logs, and validation notes.
Record fresh open-PR probe results and the completed PR Luce-Org#177 Claude feasibility pass. No code changes were needed; current mechanical PR heads remain integrated and the remaining old-layout PRs still require selective ports.
Selectively ports PR Luce-Org#177's larger SWA ring allocation into the current server/src/gemma4 layout. A ring of exactly sliding_window slots can overwrite still-visible keys during chunked long-context prefill; use a 2x padded logical window capped at max_ctx while leaving remaining TQ3 KV alignment work for a supervised port.
|
Closing: standalone dflash/src/gemma4_target_*.cpp backend is gone. Gemma4 now lives upstream at deps/llama.cpp/src/models/gemma4-iswa.cpp (vendored llama.cpp path), with spec-decode wiring via the server-side Gemma4 adapter (79abba9). The test harness API this PR targets (create_gemma4_cache) no longer exists. TQ3 KV correctness is now exercised through llama.cpp's own machinery. |
Validation (for review)
Upstream
llama-benchbaseline (Gemma4-31B-it Q4_K_M, q4_0 KV, FA on, RTX 3090): pp16384 = 376.62 tok/s, tg300 = 20.39 tok/s.Long-context KV correctness for TQ3 KV (ring-buffer + SWA boundary). Plus:
ggml_backend_buffer_clearreplaces per-tensor 1 MiB chunked memset on cache reset (matches llama.cpp convention).PR #6 of the Gemma4 split sequence. Adds the targeted test that validates long-context KV correctness for the Gemma4 target graph.
Depends on #176 → #171 → #170 → #169 → #168. Stacks on
split/gemma4-05-target-graph; the diff visible here will include those ancestor commits until they merge.Scope (2 files)
dflash/test/gemma4/test_gemma4_kv_tq3.cpp— new test (179 lines) covering TQ3-quantized KV, SWA ring wrap, decode mask forn_tokens == 1. Uses the 4-argcreate_gemma4_cache(w, max_ctx, backend, cache)API.dflash/CMakeLists.txt— wires thetest_gemma4_kv_tq3target (CUDA-linked, like the other gemma4 tests).Note on the implementation hunks listed in
pr06-kv-correctness.mdThe PR06 spec lists four implementation-side categories of changes to
gemma4_target_graph.cpp/internal.h:n_tokens == 1While extracting PR #5 we found those four behaviors to be structurally inseparable from the basic target graph (the SWA ring sizing, TQ3 256-alignment, and per-layer KV type arrays are part of the same
create_gemma4_cacheblock; the decode-mask path lives insidebuild_swa_attn_block/build_full_attn_block). Splitting them out would have produced a PR #5 that does not compile or execute end-to-end. They therefore landed with #5 (which scopes them honestly in its body).This PR adds the test that proves those behaviors are correct. A
git diff split/gemma4-05-target-graph..feature/gemma4-support -- dflash/src/gemma4_target_graph.cpp dflash/src/internal.hconfirms that the remaining deltas to those files belong to PR #7 (pFlash), #9 (DFlash draft) and #10 (MTP) — there are no isolated KV-correctness hunks left in either file.If maintainers prefer the spec's stricter split, the SWA-ring / TQ3-align code can be re-deferred from #5 to this PR in a re-roll. Flagging for discussion.
Risk: HIGH (test exercises long-context paths that drive the production decode loop)
Review checklist
./test_gemma4_kv_tq3 <target.gguf>) ✓Validation
g++ -fsyntax-only test_gemma4_kv_tq3.cppwith the project's CUDA-aware flags → exit 0.cmake --build dflash/build --target test_gemma4_kv_tq3was deferred to CI (the local 5-min build window on this host wasn't enough to relinkdflash27b).