feat: Slow-Fast inference support#163
Closed
howard0su wants to merge 9 commits into
Closed
Conversation
730fba9 to
2b8e0d9
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Extract decode-side SFI helpers into sfi_decode_utils.h (shared header) - resolve_attn_window_slice(): fast/slow window resolution - merge_sparse_index_sets(): sink+recent+selected merge with dedup - qwen35_target_graph.cpp now uses the shared header (no logic change) - New test_sfi_kernels binary: 37 tests covering core + edge cases - Window slice: q8/tq3 padding, refresh gating, boundary conditions - Sparse merge: dedup, OOB filtering, sink/recent overlap, empty sets - --bench flag: micro-benchmark showing 19.7x fewer indices at 128K - New bench_sfi_ab.py: automated A/B comparison script - Runs baseline (no refresh) vs SFI (periodic refresh) back-to-back - Parses and compares score_s, gen_s, ttft, accuracy - 32K NIAH on 2080 Ti: 6.3% TTFT improvement, 8.2% decode speedup Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- TargetCache: add sfi_selector (per-layer scores), sfi_selected (cached indices), sfi_budget (DFLASH27B_SFI_BUDGET env control) - sfi_decode_utils.h: add update_selector_scores(), topk_from_scores(), compute_sfi_indices() — full paper-aligned selection pipeline - build_full_attn_block: accept optional sfi_gather_idx/sfi_gather_len for sparse K/V gather via ggml_get_rows on fast (non-refresh) steps - build_single_layer: plumb SFI gather params through - test_sfi_kernels: 49 tests (12 new selector/topk/integration tests) - Micro-bench @ 128K: selection pipeline takes ~1.4ms, yields 2048 indices (1.6% of full context) — 62x fewer tokens than dense Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- QwenGraphInputs: add sfi_gather_idx/sfi_gather_len fields
- build_qwen35_graph: pass caller SFI tensor to all FA layers
- build_target_step: create sfi_gather_idx tensor when budget > 0,
single-token decode, no mask
- sfi_fill_indices(): fill sparse index tensor from cache before compute
- sfi_decode_utils.h: add refresh_selector_heuristic() for bootstrap,
parse_env_int() helper
- test_dflash.cpp: initialize SFI after prefill, fill indices before
sequential-verify computes, refresh indices at refresh boundaries
- test_sfi_kernels: 54 tests (5 new heuristic coverage tests)
The SFI path now activates end-to-end when DFLASH27B_SFI_BUDGET > 0:
1. After prefill: bootstrap selector with heuristic scores
2. On fast steps: gather sparse K/V (budget tokens) instead of
full windowed attention
3. At refresh boundaries: re-score and recompute indices
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…full) - Add --sfi-budget flag (sets DFLASH27B_SFI_BUDGET env for full SFI) - Run 3 configs: no-refresh baseline, refresh-only, refresh + sparse gather - Verified at 32K: all 3 produce correct NIAH, SFI path activates Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
2 issues found across 21 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/scripts/bench_sfi_ab.py">
<violation number="1" location="dflash/scripts/bench_sfi_ab.py:111">
P1: Benchmark subprocess failures are recorded but not enforced, so the script can report results and exit successfully after failed runs.</violation>
</file>
<file name="pflash/README.md">
<violation number="1" location="pflash/README.md:91">
P2: README weight-download step does not produce the GGUF draft artifact now required by updated commands.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
2 issues found across 16 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pflash/pflash/dflash_client.py">
<violation number="1">
P1: compress and generate no longer detect daemon death and can silently return partial/empty outputs on subprocess failure.</violation>
</file>
<file name="pflash/tests/bench_niah_cpp.py">
<violation number="1">
P2: Hardcoded absolute defaults make the benchmark fail in a normal checkout unless every path is overridden.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
Contributor
Author
|
After we fix RoPE bug, this PR doesn't show gain. close it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch adds the qwen35 SFI (Slow-Fast Inference) sparse full-attention path and fixes the benchmark/runtime plumbing needed to measure it correctly.
At a high level, single-token full-attention decode steps can now attend to a sparse set of cached KV positions instead of the full window, while periodic refresh steps still run dense attention to keep quality stable.
What changed
[budget, n_head_kv]so sparse KV gather matches ggml row-gather requirementslast_tokcorrectly in the daemon fallback loopcompresshandling to use the current protocol and caller-provided drafter path/archWhy
Before this, the SFI benchmark path was not trustworthy:
compressprotocol and a hardcoded drafter pathThis branch fixes those issues and makes
bench_sfi_ab.pyexercise the real path end to end.Benchmark
32K context, 1 sample:
vs baseline, full SFI:
How to test