Skip to content

feat: Slow-Fast inference support#163

Closed
howard0su wants to merge 9 commits into
Luce-Org:mainfrom
howard0su:sfi
Closed

feat: Slow-Fast inference support#163
howard0su wants to merge 9 commits into
Luce-Org:mainfrom
howard0su:sfi

Conversation

@howard0su

@howard0su howard0su commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

This branch adds the qwen35 SFI (Slow-Fast Inference) sparse full-attention path and fixes the benchmark/runtime plumbing needed to measure it correctly.

At a high level, single-token full-attention decode steps can now attend to a sparse set of cached KV positions instead of the full window, while periodic refresh steps still run dense attention to keep quality stable.

What changed

  • add SFI gather plumbing to the qwen35 step graph
  • enable sparse KV gather in qwen35 full-attention blocks during single-token/no-mask decode
  • represent SFI gather indices as [budget, n_head_kv] so sparse KV gather matches ggml row-gather requirements
  • initialize and refresh SFI selector state in the qwen35 runtime/test path
  • seed decode from prefill last_tok correctly in the daemon fallback loop
  • update qwen35 compress handling to use the current protocol and caller-provided drafter path/arch
  • update benchmark defaults to repo-local, compatible draft model paths
  • harden the Python benchmark client/harness so daemon exits or empty results fail loudly instead of being reported as valid runs
  • update README examples/documentation to match the current runtime protocol

Why

Before this, the SFI benchmark path was not trustworthy:

  • the qwen35 daemon still used an old compress protocol and a hardcoded drafter path
  • benchmark defaults pointed at stale/incompatible draft paths
  • broken runs could be reported as successful
  • full SFI crashed at runtime because sparse gather indices had the wrong shape for 3D KV cache gathers

This branch fixes those issues and makes bench_sfi_ab.py exercise the real path end to end.

Benchmark

32K context, 1 sample:

Mode Wall Score Gen TTFT Acc
baseline 46.94s 14.7s 5.0s 19.7s 1/1
refresh-only 45.88s 15.5s 4.9s 20.4s 1/1
sfi-full 45.83s 15.3s 4.4s 19.7s 1/1

vs baseline, full SFI:

  • wall time: -2.4%
  • generation time: -12.0%
  • TTFT: no change
  • accuracy: 1/1

How to test

 # baseline
 python3 scripts/bench_llm.py

 # full SFI, always sparse on eligible steps
 DFLASH27B_SFI_BUDGET=2048 python3 scripts/bench_llm.py

 # full SFI + periodic dense refresh
 DFLASH27B_SFI_BUDGET=2048 DFLASH27B_FA_REFRESH_INTERVAL=4096 \
 python3 scripts/bench_llm.py

@howard0su howard0su force-pushed the sfi branch 2 times, most recently from 730fba9 to 2b8e0d9 Compare May 13, 2026 10:14
howard0su and others added 7 commits May 16, 2026 09:30
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Extract decode-side SFI helpers into sfi_decode_utils.h (shared header)
  - resolve_attn_window_slice(): fast/slow window resolution
  - merge_sparse_index_sets(): sink+recent+selected merge with dedup
- qwen35_target_graph.cpp now uses the shared header (no logic change)
- New test_sfi_kernels binary: 37 tests covering core + edge cases
  - Window slice: q8/tq3 padding, refresh gating, boundary conditions
  - Sparse merge: dedup, OOB filtering, sink/recent overlap, empty sets
  - --bench flag: micro-benchmark showing 19.7x fewer indices at 128K
- New bench_sfi_ab.py: automated A/B comparison script
  - Runs baseline (no refresh) vs SFI (periodic refresh) back-to-back
  - Parses and compares score_s, gen_s, ttft, accuracy
  - 32K NIAH on 2080 Ti: 6.3% TTFT improvement, 8.2% decode speedup

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- TargetCache: add sfi_selector (per-layer scores), sfi_selected (cached
  indices), sfi_budget (DFLASH27B_SFI_BUDGET env control)
- sfi_decode_utils.h: add update_selector_scores(), topk_from_scores(),
  compute_sfi_indices() — full paper-aligned selection pipeline
- build_full_attn_block: accept optional sfi_gather_idx/sfi_gather_len
  for sparse K/V gather via ggml_get_rows on fast (non-refresh) steps
- build_single_layer: plumb SFI gather params through
- test_sfi_kernels: 49 tests (12 new selector/topk/integration tests)
- Micro-bench @ 128K: selection pipeline takes ~1.4ms, yields 2048
  indices (1.6% of full context) — 62x fewer tokens than dense

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- QwenGraphInputs: add sfi_gather_idx/sfi_gather_len fields
- build_qwen35_graph: pass caller SFI tensor to all FA layers
- build_target_step: create sfi_gather_idx tensor when budget > 0,
  single-token decode, no mask
- sfi_fill_indices(): fill sparse index tensor from cache before compute
- sfi_decode_utils.h: add refresh_selector_heuristic() for bootstrap,
  parse_env_int() helper
- test_dflash.cpp: initialize SFI after prefill, fill indices before
  sequential-verify computes, refresh indices at refresh boundaries
- test_sfi_kernels: 54 tests (5 new heuristic coverage tests)

The SFI path now activates end-to-end when DFLASH27B_SFI_BUDGET > 0:
  1. After prefill: bootstrap selector with heuristic scores
  2. On fast steps: gather sparse K/V (budget tokens) instead of
     full windowed attention
  3. At refresh boundaries: re-score and recompute indices

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…full)

- Add --sfi-budget flag (sets DFLASH27B_SFI_BUDGET env for full SFI)
- Run 3 configs: no-refresh baseline, refresh-only, refresh + sparse gather
- Verified at 32K: all 3 produce correct NIAH, SFI path activates

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@howard0su howard0su changed the title Fast-Slow inference feat: Slow-Fast inference support May 16, 2026
@howard0su howard0su marked this pull request as ready for review May 16, 2026 04:39

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 21 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/bench_sfi_ab.py">

<violation number="1" location="dflash/scripts/bench_sfi_ab.py:111">
P1: Benchmark subprocess failures are recorded but not enforced, so the script can report results and exit successfully after failed runs.</violation>
</file>

<file name="pflash/README.md">

<violation number="1" location="pflash/README.md:91">
P2: README weight-download step does not produce the GGUF draft artifact now required by updated commands.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

Comment thread dflash/scripts/bench_sfi_ab.py Outdated
Comment thread pflash/README.md Outdated
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 16 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pflash/pflash/dflash_client.py">

<violation number="1">
P1: compress and generate no longer detect daemon death and can silently return partial/empty outputs on subprocess failure.</violation>
</file>

<file name="pflash/tests/bench_niah_cpp.py">

<violation number="1">
P2: Hardcoded absolute defaults make the benchmark fail in a normal checkout unless every path is overridden.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

@davide221 davide221 self-requested a review May 16, 2026 17:49
@davide221 davide221 self-assigned this May 16, 2026
@davide221 davide221 removed their request for review May 16, 2026 17:50
@howard0su

Copy link
Copy Markdown
Contributor Author

After we fix RoPE bug, this PR doesn't show gain. close it.

@howard0su howard0su closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants