feat: Slow-Fast inference support by howard0su · Pull Request #163 · Luce-Org/lucebox-hub

howard0su · 2026-05-13T07:45:42Z

Summary

This branch adds the qwen35 SFI (Slow-Fast Inference) sparse full-attention path and fixes the benchmark/runtime plumbing needed to measure it correctly.

At a high level, single-token full-attention decode steps can now attend to a sparse set of cached KV positions instead of the full window, while periodic refresh steps still run dense attention to keep quality stable.

What changed

add SFI gather plumbing to the qwen35 step graph
enable sparse KV gather in qwen35 full-attention blocks during single-token/no-mask decode
represent SFI gather indices as [budget, n_head_kv] so sparse KV gather matches ggml row-gather requirements
initialize and refresh SFI selector state in the qwen35 runtime/test path
seed decode from prefill last_tok correctly in the daemon fallback loop
update qwen35 compress handling to use the current protocol and caller-provided drafter path/arch
update benchmark defaults to repo-local, compatible draft model paths
harden the Python benchmark client/harness so daemon exits or empty results fail loudly instead of being reported as valid runs
update README examples/documentation to match the current runtime protocol

Why

Before this, the SFI benchmark path was not trustworthy:

the qwen35 daemon still used an old compress protocol and a hardcoded drafter path
benchmark defaults pointed at stale/incompatible draft paths
broken runs could be reported as successful
full SFI crashed at runtime because sparse gather indices had the wrong shape for 3D KV cache gathers

This branch fixes those issues and makes bench_sfi_ab.py exercise the real path end to end.

Benchmark

32K context, 1 sample:

Mode	Wall	Score	Gen	TTFT	Acc
baseline	46.94s	14.7s	5.0s	19.7s	1/1
refresh-only	45.88s	15.5s	4.9s	20.4s	1/1
sfi-full	45.83s	15.3s	4.4s	19.7s	1/1

vs baseline, full SFI:

wall time: -2.4%
generation time: -12.0%
TTFT: no change
accuracy: 1/1

How to test

 # baseline
 python3 scripts/bench_llm.py

 # full SFI, always sparse on eligible steps
 DFLASH27B_SFI_BUDGET=2048 python3 scripts/bench_llm.py

 # full SFI + periodic dense refresh
 DFLASH27B_SFI_BUDGET=2048 DFLASH27B_FA_REFRESH_INTERVAL=4096 \
 python3 scripts/bench_llm.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Extract decode-side SFI helpers into sfi_decode_utils.h (shared header) - resolve_attn_window_slice(): fast/slow window resolution - merge_sparse_index_sets(): sink+recent+selected merge with dedup - qwen35_target_graph.cpp now uses the shared header (no logic change) - New test_sfi_kernels binary: 37 tests covering core + edge cases - Window slice: q8/tq3 padding, refresh gating, boundary conditions - Sparse merge: dedup, OOB filtering, sink/recent overlap, empty sets - --bench flag: micro-benchmark showing 19.7x fewer indices at 128K - New bench_sfi_ab.py: automated A/B comparison script - Runs baseline (no refresh) vs SFI (periodic refresh) back-to-back - Parses and compares score_s, gen_s, ttft, accuracy - 32K NIAH on 2080 Ti: 6.3% TTFT improvement, 8.2% decode speedup Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- TargetCache: add sfi_selector (per-layer scores), sfi_selected (cached indices), sfi_budget (DFLASH27B_SFI_BUDGET env control) - sfi_decode_utils.h: add update_selector_scores(), topk_from_scores(), compute_sfi_indices() — full paper-aligned selection pipeline - build_full_attn_block: accept optional sfi_gather_idx/sfi_gather_len for sparse K/V gather via ggml_get_rows on fast (non-refresh) steps - build_single_layer: plumb SFI gather params through - test_sfi_kernels: 49 tests (12 new selector/topk/integration tests) - Micro-bench @ 128K: selection pipeline takes ~1.4ms, yields 2048 indices (1.6% of full context) — 62x fewer tokens than dense Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- QwenGraphInputs: add sfi_gather_idx/sfi_gather_len fields - build_qwen35_graph: pass caller SFI tensor to all FA layers - build_target_step: create sfi_gather_idx tensor when budget > 0, single-token decode, no mask - sfi_fill_indices(): fill sparse index tensor from cache before compute - sfi_decode_utils.h: add refresh_selector_heuristic() for bootstrap, parse_env_int() helper - test_dflash.cpp: initialize SFI after prefill, fill indices before sequential-verify computes, refresh indices at refresh boundaries - test_sfi_kernels: 54 tests (5 new heuristic coverage tests) The SFI path now activates end-to-end when DFLASH27B_SFI_BUDGET > 0: 1. After prefill: bootstrap selector with heuristic scores 2. On fast steps: gather sparse K/V (budget tokens) instead of full windowed attention 3. At refresh boundaries: re-score and recompute indices Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…full) - Add --sfi-budget flag (sets DFLASH27B_SFI_BUDGET env for full SFI) - Run 3 configs: no-refresh baseline, refresh-only, refresh + sparse gather - Verified at 32K: all 3 produce correct NIAH, SFI path activates Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cubic-dev-ai

2 issues found across 21 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/bench_sfi_ab.py">

<violation number="1" location="dflash/scripts/bench_sfi_ab.py:111">
P1: Benchmark subprocess failures are recorded but not enforced, so the script can report results and exit successfully after failed runs.</violation>
</file>

<file name="pflash/README.md">

<violation number="1" location="pflash/README.md:91">
P2: README weight-download step does not produce the GGUF draft artifact now required by updated commands.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic}

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cubic-dev-ai

2 issues found across 16 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pflash/pflash/dflash_client.py">

<violation number="1">
P1: compress and generate no longer detect daemon death and can silently return partial/empty outputs on subprocess failure.</violation>
</file>

<file name="pflash/tests/bench_niah_cpp.py">

<violation number="1">
P2: Hardcoded absolute defaults make the benchmark fail in a normal checkout unless every path is overridden.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic}

howard0su · 2026-05-21T11:29:47Z

After we fix RoPE bug, this PR doesn't show gain. close it.

howard0su force-pushed the sfi branch 2 times, most recently from 730fba9 to 2b8e0d9 Compare May 13, 2026 10:14

howard0su and others added 7 commits May 16, 2026 09:30

Implement fast-slow prefill controls

36d6ff8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Tighten fast-slow review fixes

455094d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use upstream ci.yml

f65dc22

howard0su force-pushed the sfi branch from 41d9040 to f65dc22 Compare May 16, 2026 01:59

Fix qwen35 SFI benchmark runtime

daa0849

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

howard0su changed the title ~~Fast-Slow inference~~ feat: Slow-Fast inference support May 16, 2026

howard0su marked this pull request as ready for review May 16, 2026 04:39

cubic-dev-ai Bot reviewed May 16, 2026

View reviewed changes

Comment thread dflash/scripts/bench_sfi_ab.py Outdated

Comment thread pflash/README.md Outdated

Trim unrelated SFI branch changes

1851617

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cubic-dev-ai Bot reviewed May 16, 2026

View reviewed changes

davide221 self-requested a review May 16, 2026 17:49

davide221 self-assigned this May 16, 2026

davide221 removed their request for review May 16, 2026 17:50

howard0su closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Slow-Fast inference support#163

feat: Slow-Fast inference support#163
howard0su wants to merge 9 commits into
Luce-Org:mainfrom
howard0su:sfi

howard0su commented May 13, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

howard0su commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

howard0su commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why

Benchmark

How to test

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

howard0su commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

howard0su commented May 13, 2026 •

edited

Loading