Skip to content

Latest commit

 

History

History
785 lines (629 loc) · 36.4 KB

File metadata and controls

785 lines (629 loc) · 36.4 KB

Luce DFlash benchmark results

Single RTX 3090 24 GB, CUDA 12, driver 535. Target: unsloth/Qwen3.5-27B-GGUF (Q4_K_M, ~16 GB). Draft: z-lab/Qwen3.5-27B-DFlash (BF16, 3.46 GB). Concurrency = 1, greedy decoding, n_gen=256. Reproduce with python3 scripts/bench_llm.py (samples 10 prompts/dataset, seed=42).

Headline — AR vs Luce DFlash at concurrency 1

Task AR tok/s DFlash tok/s AL Speedup
HumanEval 37.78 129.52 8.31 3.43×
Math500 37.71 110.51 7.04 2.93×
GSM8K 37.65 96.15 6.14 2.55×

AR = autoregressive target-only decode via test_generate. DFlash = block-diffusion draft + DDTree budget 22 verify + fast rollback. AL = mean committed tokens per draft/verify step (acceptance length).

Datasets pulled live via HuggingFace datasets:

  • HumanEval — openai_humaneval, prompt field
  • GSM8K — gsm8k main split, Question: … Answer: format
  • Math500 — HuggingFaceH4/MATH-500, Problem: … Solution: format

Per-prompt numbers (seed 42)

HumanEval (10 samples)

# n_tok AR DFlash AL
01 84 37.98 137.91 8.83
02 138 37.90 143.38 9.14
03 134 37.88 137.49 8.83
04 120 37.84 153.77 9.85
05 172 37.76 131.74 8.53
06 118 37.59 113.97 7.31
07 51 37.78 103.27 6.56
08 141 37.68 158.40 10.24
09 125 37.71 128.22 8.26
10 95 37.65 87.04 5.57
mean 37.78 129.52 8.31

Peak per-prompt: 158.40 tok/s at AL 10.24 (4.20× over AR on the same prompt).

GSM8K (10 samples)

# n_tok AR DFlash AL
01 45 37.62 93.87 5.95
02 111 37.53 90.59 5.82
03 49 37.73 87.79 5.57
04 70 37.67 82.11 5.22
05 102 37.62 127.83 8.26
06 118 37.61 88.67 5.69
07 113 37.62 86.86 5.57
08 50 37.72 102.98 6.56
09 43 37.69 109.66 6.92
10 96 37.72 91.12 5.82
mean 37.65 96.15 6.14

Math500 (10 samples)

# n_tok AR DFlash AL
01 257 37.60 100.97 6.56
02 53 37.73 115.62 7.31
03 40 37.76 126.47 8.00
04 50 37.76 118.20 7.53
05 117 37.69 114.55 7.31
06 76 37.70 108.63 6.92
07 43 37.72 90.41 5.69
08 79 37.73 100.10 6.40
09 52 37.69 91.69 5.82
10 57 37.74 138.45 8.83
mean 37.71 110.51 7.04

Why the speedup varies by task

Acceptance length is the dominant factor — tok/s is roughly linear in AL when per-step overhead is fixed:

Task AL Speedup vs AR
HumanEval 8.31 3.43×
Math500 7.04 2.93×
GSM8K 6.14 2.55×

HumanEval prompts are highly regular (function signatures + docstrings), the draft nails consecutive tokens. GSM8K is natural-language arithmetic reasoning, the draft is less confident, tree verify rescues less.

128K context configuration

max_ctx = 131072 + DFLASH27B_KV_Q4=1 (Q4_0 K+V cache, 8× compression vs F16). Sliding target_feat ring (4096 slots) keeps captured features at 0.2 GB regardless of context length. --ddtree-budget=16 keeps per-layer ssm_intermediate under 1.3 GB.

Prompt length KV size Prefill Decode tok/s
520 (HE) ~35 MB 0.06 s 130
13K ~860 MB 15 s 99
32K ~2.1 GB 106 s 35
128K ~8.4 GB ~10 min ~15-20 (est)

Q4_0 KV costs ~3% mean tok/s vs F16 at short contexts and is the only thing that lets 128K allocate at all.

DDTree budget sweep (HumanEval, n_gen=256, f16 intermediate)

Historical tuning run from commit f1cb9bf (2026-04-16). Used to pick the default budget=22. Fresh run at budget=22 on commit 5bb7f8c is the 129.5 tok/s / AL 8.31 reported in the headline above; the ~5 tok/s delta vs the 135.8 row here comes from sample variance across the 10 prompts and from minor build-flag drift between the two commits.

Budget Mean AL Mean tok/s
15 7.64 125.3
16 7.81 128.7
18 8.22 131.2
20 8.64 133.9
22 8.88 135.8
24 8.91 133.0
30 8.86 120.5
40 8.90 105.1

AL plateaus at ~8.9, past budget 22 each extra node costs more in verify time than it buys in accept. Memory ceiling at budget 26 on 24 GB (per-token SSM intermediate cache is hybrid-only overhead).

Kernel-level wins (cumulative, chain mode → DDTree budget 22 + f16)

Starting point: Chain DFlash at 112.8 tok/s mean on HumanEval, AL 7.67.

Optimization Δ tok/s Δ AL Note
DDTree budget 20, f32 intermediate +15.1 +0.77 Heap-based best-first tree, 20 nodes
Chain pre-seed in build_ddtree +~5 Fixes top-1 chain coverage under Q4 noise (prior AL ~4)
Tree-aware ggml_ssm_conv_tree kernel +~1 Sibling conv window gathers via parent chain, not DFS
target_feat compaction after sibling-accept +~0.8 Stale feature pruning
OpenMP-parallel CPU top-K, K reduced 32→8 +2.1 Shaves 7% off draft step
Fast K=1 path for budget=15 +1.5 Skips 11 ms CPU top-K when no siblings needed
D2D cudaMemcpyAsync for target_feat (GPU→GPU) +3.7 Replaces GPU→CPU→GPU round trip
ggml_gated_delta_net_tree_persist kernel +12.4 Direct-writes SSM intermediates, skips 9 ms ggml_cpy per step
Budget 20 → 22, f16 intermediate +5.5 +0.24 f16 cuts intermediate bandwidth in half
Total +16.7 +0.64 129.5 tok/s, AL 8.31 (HumanEval mean, fresh run)

Reproducibility

  • Deterministic: greedy decode + greedy verify. Same prompts + same weights + same binary = same numbers ±1 tok/s.
  • Full bench (10×3 = 30 prompts): ~15 min.
  • All numbers above reproduced on 2026-04-20 from commit 5bb7f8c with:
    python3 scripts/bench_llm.py
    

Hardware ceiling notes

  • Published DFlash paper on Qwen3-4B/8B/30B-MoE (pure attention, BF16, B200) reports 4-5× over AR on HumanEval/Math500 at concurrency 1. Ours: 3.43× on 27B hybrid Q4_K_M on RTX 3090.
  • Memory ceiling: per-token SSM intermediate cache (hybrid-only cost) caps tree budget at ~26 on 24 GB. The paper uses budgets up to 1024 on pure-attention models with zero per-node memory tax.
  • Per-token verify cost drops from 25 ms at N=1 to 0.97 ms at N=128 (ggml-cuda Q4_K matmul amortises well with batch size).

RTX 2080 Ti (Turing, sm_75, 22 GB)

Single RTX 2080 Ti 22 GB, CUDA 12.4. Same target/draft as above. BF16 draft weights auto-converted to FP16 at load time (cuBLAS BF16 GEMM has no tensor core acceleration on SM 7.5; FP16 conversion gives 3.9× faster draft compute via Turing tensor cores).

Build: cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=75

RTX 2080 Ti headline

Task AR tok/s DFlash tok/s AL Speedup
HumanEval 19.88 53.42 8.14 2.69×
Math500 19.67 49.01 7.30 2.49×
GSM8K 19.49 43.55 6.53 2.23×

RTX 2080 Ti per-prompt — HumanEval (10 samples)

# n_tok AR DFlash AL
01 84 19.88 58.69 8.83
02 138 19.43 45.47 9.14
03 134 19.60 62.67 9.14
04 120 20.16 63.42 9.14
05 172 19.74 56.89 8.53
06 118 20.20 44.32 6.40
07 51 20.26 54.14 8.00
08 141 19.70 40.34 5.95
09 125 19.91 70.70 10.67
10 95 19.88 37.56 5.57
mean 19.88 53.42 8.14

Peak per-prompt: 70.70 tok/s at AL 10.67 (3.55× over AR on the same prompt).

RTX 2080 Ti per-prompt — GSM8K (10 samples)

# n_tok AR DFlash AL
01 45 19.24 39.54 5.82
02 111 19.70 39.49 5.82
03 49 19.33 57.01 8.53
04 70 19.70 38.35 5.69
05 102 19.67 36.77 5.45
06 118 19.39 40.45 5.95
07 113 19.55 54.02 8.46
08 50 18.92 42.16 6.51
09 43 19.68 48.07 7.11
10 96 19.72 39.63 5.95
mean 19.49 43.55 6.53

RTX 2080 Ti per-prompt — Math500 (10 samples)

# n_tok AR DFlash AL
01 257 19.70 42.64 6.40
02 53 19.80 49.53 7.31
03 40 19.96 52.76 8.00
04 50 19.49 62.08 9.48
05 117 17.85 43.69 6.56
06 76 19.87 45.42 6.74
07 43 20.05 42.57 6.40
08 79 19.42 51.86 7.76
09 52 20.02 39.34 5.82
10 57 20.53 60.18 8.53
mean 19.67 49.01 7.30

RTX 2080 Ti vs RTX 3090 comparison

Metric RTX 3090 RTX 2080 Ti Ratio
AR tok/s (HE) 37.78 19.88 0.53×
DFlash tok/s (HE) 129.52 53.42 0.41×
Mem BW 936 GB/s 616 GB/s 0.66×
SMs 82 68 0.83×
VRAM 24 GB 22 GB 0.92×

AR scaling (~0.53×) tracks bandwidth × SM count. DFlash scaling (~0.41×) is lower because the draft compute bottleneck is proportionally larger on a slower GPU, even after the BF16→FP16 fix. Acceptance length is identical (same draft model, same tokens), confirming the FP16 conversion is numerically faithful.

RTX 5090 (Blackwell, sm_120/sm_120a, 32 GB)

Single RTX 5090 32 GB, CUDA 13.0.88, driver 595.58.03. Target: unsloth/Qwen3.6-27B-GGUF (Qwen3.6-27B-UD-Q5_K_XL.gguf, ~19 GB). Draft: local Qwen3.6-27B DFlash safetensors (model.safetensors, ~3.3 GB). Concurrency = 1, greedy decoding, n_gen=256.

Build: cmake -B build-luce-sm120 -S . -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=120 -DDFLASH27B_USER_CUDA_ARCHITECTURES=120 -DDFLASH27B_ENABLE_BSA=ON Runtime: FP16/FP16 KV, FA window 4096, DDTree budget 22.

These numbers use a newer Qwen3.6 Q5_K_XL target, so they are not an apples-to-apples hardware comparison with the RTX 3090 Qwen3.5 Q4_K_M run above.

RTX 5090 headline

Task AR tok/s DFlash tok/s AL Speedup
HumanEval 58.25 218.23 7.12 3.75×
Math500 57.57 219.06 7.31 3.80×
GSM8K 58.39 179.07 5.88 3.07×

RTX 5090 per-prompt — HumanEval (10 samples)

# n_tok AR DFlash AL
01 84 59.13 225.95 7.31
02 138 57.28 207.92 6.74
03 134 56.82 238.37 7.76
04 120 58.92 332.94 11.13
05 172 58.93 237.91 7.76
06 118 58.25 147.64 4.74
07 51 58.24 197.95 6.40
08 141 58.21 196.54 6.40
09 125 58.39 243.31 8.00
10 95 58.31 153.74 4.92
mean 58.25 218.23 7.12

Peak per-prompt: 332.94 tok/s at AL 11.13 (5.65× over AR on the same prompt).

RTX 5090 per-prompt — GSM8K (10 samples)

# n_tok AR DFlash AL
01 45 58.62 188.94 6.10
02 111 58.94 202.09 6.56
03 49 58.88 211.92 7.06
04 70 58.14 153.83 4.92
05 102 58.12 160.01 5.12
06 118 58.19 187.03 6.10
07 113 58.26 166.77 5.80
08 50 58.37 217.12 7.11
09 43 58.20 86.08 2.93
10 96 58.13 216.96 7.11
mean 58.39 179.07 5.88

RTX 5090 per-prompt — Math500 (10 samples)

# n_tok AR DFlash AL
01 257 58.15 214.23 7.11
02 53 58.23 197.71 6.40
03 40 58.80 232.61 7.53
04 50 59.00 191.89 6.24
05 117 58.27 273.97 9.14
06 76 55.89 195.37 6.74
07 43 56.89 250.16 8.26
08 79 57.26 224.60 7.53
09 52 56.56 170.18 6.12
10 57 56.68 239.89 8.00
mean 57.57 219.06 7.31

RTX 5090 DDTree budget sweep

Fast HumanEval sweep, 10 prompts, n_gen=128, same target/draft, FP16/FP16 KV, FA window 4096.

Budget Mean AL Mean tok/s
15 4.99 174.45
16 5.76 176.98
18 6.93 206.62
20 6.94 204.03
22 7.25 211.20
24 7.19 203.08
26 7.09 199.96
30 7.44 206.19
32 6.87 183.34
40 6.97 174.52
48 7.07 165.24
64 7.14 148.12

Budget 12 failed all prompts with a ggml shape assertion. Budget 22 remains the best short-context throughput default on this 5090 build. Budget 30 produced the highest mean AL but lower throughput, so it is a quality-biased experiment rather than the base setting.

RTX 5090 — Q4_K_M, DDTree budget 40, no-thinking (community)

Single RTX 5090 32 GB GDDR7 (sm_120, 1792 GB/s), CUDA 12.8, Windows 11. Target: unsloth/Qwen3.6-27B-GGUF (Qwen3.6-27B-Q4_K_M.gguf, ~15.7 GB). Draft: z-lab/Qwen3.6-27B-DFlash safetensors (~3.2 GB). Concurrency = 1, greedy decoding, n_gen=256 (HumanEval/GSM8K), n_gen=2048 (Math500).

Build: MSVC 14.41, cmake -DCMAKE_CUDA_ARCHITECTURES=120 -DBUILD_SHARED_LIBS=OFF, BSA enabled. Runtime: default KV (auto), DDTree budget 40, --no-thinking (chat template with enable_thinking=False).

Q4_K_M is a direct quant match with the RTX 3090 headline numbers above. Budget 40 was swept as optimal for the 5090's 170 SMs + 1792 GB/s bandwidth (vs budget 22 on the 3090's 82 SMs + 936 GB/s). Thinking is disabled because the DFlash drafter was trained on Qwen3.5 output with thinking disabled; <think> tokens tank acceptance rate.

RTX 5090 Q4_K_M headline

Task AR tok/s DFlash tok/s AL Speedup
HumanEval 42.34 205.02 12.93 4.84×
Math500 42.50 182.68 9.70 4.30×
GSM8K 42.65 153.08 8.20 3.59×

Math500 score: 10/10 (using \boxed{} extraction).

RTX 5090 Q4_K_M per-prompt — HumanEval (10 samples)

# n_tok AR DFlash AL
01 94 41.74 211.11 15.17
02 148 42.09 177.62 9.48
03 144 42.13 247.98 15.80
04 130 41.42 209.79 14.00
05 182 42.64 197.49 10.67
06 128 42.44 226.57 14.40
07 61 42.93 208.07 14.33
08 151 43.05 195.22 10.24
09 135 42.59 167.76 9.16
10 105 42.39 208.57 16.00
mean 42.34 205.02 12.93

Peak per-prompt: 247.98 tok/s at AL 15.80 (5.89× over AR).

RTX 5090 Q4_K_M per-prompt — GSM8K (10 samples)

# n_tok AR DFlash AL
01 56 42.58 176.79 9.14
02 122 42.43 139.01 7.11
03 60 42.87 150.98 8.43
04 81 42.76 187.87 10.24
05 113 42.83 116.41 6.16
06 129 42.55 132.95 6.92
07 124 42.39 176.73 10.11
08 61 42.77 151.45 8.00
09 54 42.76 143.34 7.90
10 107 42.54 155.26 8.00
mean 42.65 153.08 8.20

RTX 5090 Q4_K_M per-prompt — Math500 (10 samples)

# n_tok AR DFlash AL
01 276 42.42 141.11 7.64
02 72 42.56 183.87 9.41
03 59 42.83 194.73 10.66
04 69 41.75 178.42 9.89
05 136 42.86 206.66 11.00
06 95 42.29 150.41 8.13
07 62 42.39 168.75 8.71
08 98 42.74 203.99 10.75
09 71 42.37 203.77 10.64
10 76 42.79 195.07 10.12
mean 42.50 182.68 9.70

RTX 5090 Q4_K_M — 3090 vs 5090 comparison

Both runs use Q4_K_M target, same bench_llm.py methodology (10 samples per task, seed=42 shuffle, greedy decoding). Budget tuned per-GPU.

Metric 3090 (budget=22) 5090 (budget=40) Ratio
AR tok/s (HE) 37.78 42.34 1.12×
DFlash tok/s (HE) 129.52 205.02 1.58×
AL (HE) 8.31 12.93 1.56×
AR tok/s (Math500) 37.71 42.50 1.13×
DFlash tok/s (Math500) 110.51 182.68 1.65×
AR tok/s (GSM8K) 37.65 42.65 1.13×
DFlash tok/s (GSM8K) 96.15 153.08 1.59×
Mem BW 936 GB/s 1792 GB/s 1.91×
SMs 82 170 2.07×
VRAM 24 GB 32 GB 1.33×

AR scaling (~1.13×) is modest — the larger model already saturates bandwidth on the 3090. DFlash scaling (~1.6×) is super-linear relative to AR because higher bandwidth lets more speculative tokens survive per verify step.

The AL difference (12.93 vs 8.31 on HumanEval) reflects two factors: budget 40 vs 22 (wider tree captures more tokens per step), and --no-thinking on the 5090 run (drafter predicts non-thinking output much better, since it was trained on Qwen3.5 output with thinking disabled). The 3090 run also used Qwen3.5 with thinking disabled (server default), so the comparison is fair: both runs generate non-thinking output.

RTX 5090 Q4_K_M DDTree budget sweep

5 HumanEval prompts, n_gen=256, --no-thinking, same Q4_K_M target/draft as the headline numbers above.

Budget Mean AL Mean tok/s
22 9.01 189.18
28 9.20 186.87
32 9.75 194.95
36 9.75 193.70
40 10.07 197.10
48 10.07 183.99

Budget 40 peaks at 197.10 tok/s. AL saturates at 10.07 between budget 40 and 48, but 48 regresses on throughput (verify cost outpaces accept gain). Budget 28 also regresses vs 22 — the useful jump is 22→32. The 32–36–40 plateau is within ~2% (run-to-run noise); 40 is chosen because it has the highest mean and AL is at saturation.

Laguna-XS.2 target on RTX 3090 (Poolside MoE, Q4_K_M)

Hand-rolled CUDA forward (Path A, ggml-only) for the 40-layer / 256-expert Laguna-XS.2 target. Loader pins 678 tensors at 18.77 GiB on GPU + 110 MiB tok_embd CPU-only, leaving room for KV cache and PFlash drafter activations in 24 GB. Drafter for compression: Qwen3-0.6B BF16 GGUF (Qwen tokenizer cross-mapped to Laguna BPE).

Dense TTFT (no PFlash compression, full chunked prefill)

Measured with bench_laguna_ttft, DFLASH_KV_TYPE=q4_0 for ctx > 32K, default chunk=4096 except where noted (smaller chunks needed at long ctx to keep the activation alloc inside 24 GB):

Context KV chunk TTFT (s) tok/s
4 096 Q8_0 4096 1.04 3 932
16 384 Q8_0 4096 5.71 2 867
32 768 Q4_0 2048 19.26 1 701
65 536 Q4_0 1024 53.17 1 233
65 536 Q4_0 2048 51.33 1 277

65K @ chunk=4096 OOMs on a 24 GB 3090 because the F32 mask alone needs ~1 GB; use chunk=1024 or 2048 for ctx > 32K. PFlash compression (below) bypasses the problem by feeding the target a much smaller compressed prompt.

NIAH single-needle retrieval with PFlash compression (depth=0.5)

scripts/laguna_pflash_niah.py orchestrates haystack → drafter compress → cross-tokenizer round-trip with word-boundary recovery → Laguna prefill → decode → grep needle. The drafter scores Qwen3 token chunks; the word-boundary helper expands each kept run outward to whitespace before re-tokenizing as Laguna IDs, so multi-token needles like BLUEHORIZON-7421 survive aggressive keep ratios.

Context KV keep drafter (s) target prefill (s) end-to-end TTFT NIAH
4 096 Q8_0 0.10 1.54 0.39 1.92 s
16 384 Q8_0 0.10 1.27 0.51 1.78 s
16 384 Q8_0 0.20 1.20 0.91 2.11 s
32 768 Q4_0 0.10 2.08 0.91 2.99 s
32 768 Q4_0 0.20 2.06 1.97 4.03 s ❌ (synthetic-NIAH variance; keep=0.10 PASS)
65 536 Q4_0 0.10 ~5 ~6 ~11 s
65 536 Q4_0 0.20 ~5 ~8 ~13 s
65 536 Q4_0 0.30 ~5 ~10 ~15 s
65 536 Q4_0 0.50 ~5 ~17 ~22 s
131 072 Q4_0 0.10 11.11 4.79 15.91 s
131 072 Q4_0 0.20 11.20 13.55 24.75 s
131 072 Q4_0 0.30 11.41 26.43 37.84 s

Decode is autoregressive (~96 tok/s @ ctx=4K, ~27 tok/s @ ctx=131K) until a matched Laguna spec-decode draft model is published; the dflash daemon's draft-loaded path is reserved for that future drop-in.

Sampler smoke (test_laguna_daemon, prompt = "Tell me a one-line haiku about clouds.")

samp= tail first 90 chars of decode
(none, greedy) Fluffy white giants / Sail through the sky on gentle / Wings of summer breeze
2.0,1.0,0,1.0,42 requires_blog_proxygps … setUser dirs feedbackUse thin covsyl Banks/mythtv MITMially beac
2.0,1.0,0,1.0,43 Phantom ships sail cre ways—.permissions['agrant\ paramount Then never Streaming Home>s`
1.0,0.5,0,1.0,99 (top_p) Clouds drift like cotton dreams floating through the sky.

Four distinct outputs from the same prompt confirms the rep_penalty → top_k → softmax(temp) → top_p → draw chain is wired correctly end to end.

Speedup at 128K context (RTX 3090, Q4_K_M target, Q4_0 KV, FA on)

Path Time @ 131072 tok/s Notes
llama.cpp pp131072 (vendored fork) 86.60 s 1513.4 llama-bench -p 131072 -n 0 -ctk q4_0 -ctv q4_0 -fa 1 -t 8 -r 1 -ngl 99
dflash + PFlash keep=0.10 (end-to-end) 15.91 s 8 240 drafter compress 11.11s + target prefill 4.79s
dflash target prefill only 4.79 s 27 364 effective on the 131 072-token original prompt (target processes 13 120 compressed)

Headline: dflash PFlash gives 5.4× faster TTFT than llama.cpp at 131K context end-to-end on a 24 GB RTX 3090. The target prefill alone runs 18.1× faster because PFlash compression has reduced the effective input length from 131 072 tokens to 13 120 (10× token-count drop). The drafter adds 11 s of fixed overhead which dominates at long context but folds below 1× of the target prefill cost as the haystack shrinks (4K @ keep=0.10 spends 1.5 s drafter + 0.4 s target, net 1.92 s vs llama.cpp 1.7 s).

Reproducing the 11.11 s drafter number requires Block-Sparse Attention on the Qwen3-0.6B drafter forward. The PflashDaemon Python wrapper sets these env vars by default; the dflash daemon honours them at runtime but does not force them, so any caller (including dflash_server) is free to opt out:

export DFLASH_FP_USE_BSA=1     # mit-han-lab BSA, FA-2 derived (sm_80+)
export DFLASH_FP_ALPHA=0.85    # importance-score temperature

Without BSA the drafter falls back to dense attention and the 131 K drafter forward becomes multi-minute (O(N²) work). At ctx ≤ 8 K the fallback is fast enough that BSA is optional.

Parity vs HF reference (deferred)

scripts/parity_laguna.py runs identical token IDs through the dflash daemon and a Hugging Face LagunaForCausalLM reference loaded from poolside/Laguna-XS.2, then reports last-position argmax agreement at 4K—128K context. Cannot be run on a single 24 GB GPU because the BF16 reference weighs ~37 GiB; pin to an A6000 / H100 (or use CPU offload) to produce the cos-sim numbers. The dflash forward itself was originally verified to match llama.cpp build_laguna for 30+ tokens during the scaffold-time bring-up (see Lucebox/Laguna-XS.2-GGUF README). The NIAH table above functions as the in-PR functional sanity check at 4K–131K.

RTX 5090 (Blackwell, sm_120/sm_120a, 32 GB) — long-context NIAH

Companion to the short-context RTX 5090 section (HumanEval / Math500 / GSM8K, added in #86). That section validates speculative decoding on short prompts where PFlash compression is not engaged; this one validates the full PFlash drafter scoring + ~20× compression + DFlash decode pipeline at 117K tokens.

Single RTX 5090 32 GB, CUDA 13.2, driver 595.58. Target: unsloth/Qwen3.6-27B-GGUF (Qwen3.6-27B-Q4_K_M.gguf, ~16.8 GB). Q4_K_M (vs Q5_K_XL in the short-context section above) leaves more VRAM headroom for the FP16 KV cache at 117K context. Draft: local Qwen3.6-27B DFlash safetensors + Qwen3-0.6B-BF16 PFlash drafter.

Build: cmake -B build-luce-sm120 -S . -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=120 -DDFLASH27B_USER_CUDA_ARCHITECTURES=120 -DDFLASH27B_ENABLE_BSA=ON

Final ("V4") runtime config — driven via the optimizations/pflash/tests/bench_niah_cpp.py CLI flags added in #90 plus daemon env vars (each bullet leads with the exact interface):

  • --keep-ratio=0.05 (PFlash compression target ratio)
  • DFLASH_FP_USE_BSA=1 and DFLASH_FP_ALPHA=0.70 (BSA enabled, block-selection threshold; both are daemon env vars)
  • --ddtree-budget=22
  • --fa-window=4096 (also settable via DFLASH27B_FA_WINDOW=4096)
  • --kv-tq3=0 (Q8_0 KV cache — the daemon default when TQ3_0 is disabled and no other KV type is set; 5090 has VRAM headroom so TQ3_0 isn't needed)
  • --n-gen=1024

Test set: 10 NIAH prompts at 117K tokens (margin under Qwen3.6-27B's 131K native RoPE limit, generated with optimizations/pflash/tests/niah_gen.py at calibrated char_per_tok).

RTX 5090 long-ctx headline

Metric Value
NIAH accuracy 20/20 across 2 runs of n=10
Decode throughput 210.7 tok/s avg (range 179–230)
TTFT 10.0 s
Compression 20.2× (117064 → 5800 tokens)
Prefill (compressed) 3.9 s for ~5800 tokens
Drafter score+migrate ~5.8 s

These headline numbers are the Phase 4 reliability run at the V4 config above (n=20 across 2 independent runs of 10 prompts each). The three exploratory sweeps below — alpha, then budget, then keep — are what selected the V4 config; each table holds the non-swept parameters at the values discovered in the prior phase, so the swept-axis throughput numbers are not directly comparable to the headline (different keep ratios produce different per-step decode rates, see the keep-ratio table below).

Phase 1 — DFLASH_FP_ALPHA sweep (held: --keep-ratio=0.08, --ddtree-budget=28)

DFLASH_FP_ALPHA NIAH Decode tok/s
0.60 10/10 213.7
0.70 10/10 210.6
0.85 8/10 204.6

The docs default of DFLASH_FP_ALPHA=0.85 fails 2/10 prompts at this setup. This may be specific to long context, Qwen3.6, or Blackwell — I have not isolated which. Validating alpha per setup is recommended. I chose 0.70 over 0.60 for reliability margin: 0.60 wins decode by only 1.5%, below the run-to-run variance, on an n=10 sample.

Phase 2 — budget sweep (held: DFLASH_FP_ALPHA=0.70, --keep-ratio=0.08)

--ddtree-budget NIAH Decode tok/s
22 10/10 217.4
28 10/10 210.7
30 10/10 211.1

#86's short-context budget sweep above on the same 5090 build also lands on budget=22 as throughput-optimal (211.20 mean tok/s at AL 7.25). So budget=22 is a stable default for Qwen3.6-27B on Blackwell across context regimes, not a knob that needs per-context-length tuning. This is the most useful cross-reference between the two sections.

Phase 3 — keep-ratio sweep (held: DFLASH_FP_ALPHA=0.70, --ddtree-budget=22)

--keep-ratio NIAH Decode tok/s TTFT Compression
0.05 10/10 210.4 10.0 s 20.2×
0.06 10/10 212.1 10.5 s 16.8×
0.08 10/10 216.5 13.0 s 12.6×

--keep-ratio=0.08 wins per-token throughput by ~3% but pays 30% more TTFT and gives up 38% of the compression. For the 117K NIAH workload I chose 0.05 to optimize end-to-end response latency; 0.08 is preferable when sustained throughput on already-compressed prompts dominates.

Note on --kv-tq3

I set --kv-tq3=0, which leaves the daemon at its Q8_0 KV-cache default (no DFLASH27B_KV_K/DFLASH27B_KV_V overrides). The 3-bit TQ3_0 cache trades VRAM for memory bandwidth; on a 5090 with 32 GB and ~22 GB peak usage at 117K, that trade isn't worth taking. Users on 4090 or 3090 (24 GB) at this context length should likely keep --kv-tq3=1. To go further than Q8_0 in either direction set the K/V types explicitly via DFLASH27B_KV_K=<type> DFLASH27B_KV_V=<type>.

RTX 4090 (Ada, sm_89, 24 GB) — WSL2 (community)

Single RTX 4090 24 GB, CUDA 13.2, driver 596.21, WSL2 (Ubuntu) on Windows 11. i7-13700K, 64 GB host RAM (32 GB WSL allocation). Build: cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=89 -DDFLASH27B_ENABLE_BSA=ON Models on native ext4 (/home/), not NTFS /mnt/ (9P filesystem bottlenecks model loading).

Qwen3.5-27B Q4_K_M — RTX 4090 headline

Target: unsloth/Qwen3.5-27B-GGUF (Qwen3.5-27B-Q4_K_M.gguf, ~16 GB). Draft: spiritbuun/Qwen3.5-27B-DFlash-GGUF (dflash-draft-q4_k_m.gguf, 986 MB). Concurrency = 1, greedy decoding, n_gen=256, bench_he.py.

Task AR tok/s DFlash tok/s AL Speedup
HumanEval 32.20 125.37 7.77 3.89×

AR = test_generate, DFlash = DDTree budget 28 + fast rollback.

Qwen3.5-27B DDTree budget sweep (HumanEval, n_gen=256)

Budget Mean AL Mean tok/s
22 7.45 122.95
26 7.60 123.68
28 7.77 125.37
30 7.62 123.12
34 7.28 111.76

Budget 28 is optimal on 4090 (vs 22 on 3090). The 4090's 72 MB L2 cache (vs 3090's 6 MB) lets the DDTree verification working set stay in cache, enabling larger trees without DRAM bandwidth penalties.

Qwen3.6-27B Q4_K_M — RTX 4090 headline

Target: unsloth/Qwen3.6-27B-GGUF (Qwen3.6-27B-Q4_K_M.gguf, ~16 GB). Draft: Lucebox/Qwen3.6-27B-DFlash-GGUF (dflash-draft-3.6-q8_0.gguf, 1.8 GB). Concurrency = 1, greedy decoding, n_gen=256, bench_he.py.

Task AR tok/s DFlash tok/s AL Speedup
HumanEval 33.74 84.55 5.32 2.51×

AR = test_generate, DFlash = DDTree budget 36 + fast rollback.

Qwen3.6-27B DDTree budget sweep (HumanEval, n_gen=256)

Budget Mean AL Mean tok/s
22 4.85 79.65
26 5.00 82.37
28 4.99 82.46
30 5.06 79.82
34 5.17 83.00
35 5.20 83.12
36 5.32 84.55
37 5.32 84.30
38 5.32 82.80

Budget 36 is optimal for Qwen3.6 on 4090. AL saturates at 5.32 from budget 36+. The lower AL vs Qwen3.5 (5.32 vs 7.77) reflects the Qwen3.6 drafter still being under training per the HuggingFace model card.

Qwen3.6-27B per-prompt breakdown (budget=36)

# prompt steps AL tok/s
01 has_close_elements 31 8.26 132.36
02 separate_paren_groups 46 5.57 91.10
03 truncate_number 0 0.00 0.00
04 below_zero 46 5.57 90.15
05 mean_absolute_deviation 37 6.92 109.82
06 intersperse 40 6.40 101.67
07 parse_nested_parens 32 8.00 127.26
08 filter_by_substring 43 5.95 91.74
09 sum_product 0 0.00 0.00
10 rolling_max 39 6.56 105.53
mean 5.32 84.96

Prompts 03 (truncate_number) and 09 (sum_product) emit EOS immediately (0 generated tokens). Excluding those: 106.2 tok/s mean across 8 active prompts.

RTX 4090 vs RTX 3090 comparison (Qwen3.5-27B Q4_K_M)

Metric 3090 (budget=22) 4090 WSL2 (budget=28) Ratio
AR tok/s (HE) 37.78 32.20 0.85×
DFlash tok/s (HE) 129.52 125.37 0.97×
AL (HE) 8.31 7.77 0.94×
Mem BW 936 GB/s 1008 GB/s 1.08×
SMs 82 128 1.56×
L2 Cache 6 MB 72 MB 12×
VRAM 24 GB 24 GB 1.0×

AR is 15% slower on 4090 WSL2 — likely WSL2 overhead (pin_memory=False, 9P bridge latency). DFlash is within 3% because the larger L2 cache compensates. Bare metal Linux 4090 should match or exceed the 3090 numbers.

WSL2 notes

  • Always use native ext4 (/home/) for model files. NTFS mounts (/mnt/) via WSL2's 9P filesystem showed 9% CPU utilization and 337K voluntary context switches during model loading (vs 85%+ on ext4).
  • WSL2 forces pin_memory=False, adding ~3–5% decode overhead vs bare metal.
  • Building CUDA binaries under WSL2 works correctly with -DCMAKE_CUDA_ARCHITECTURES=89. CUDA toolkit path must be in $PATH (/usr/local/cuda-13.2/bin).

Server-mode reference (not apples-to-apples with bench_he.py)

Running via dflash_server (OpenAI-compatible HTTP) with TQ3 KV cache and 128K context:

DFLASH27B_KV_TQ3=1 ./build/dflash_server Qwen3.6-27B-Q4_K_M.gguf \
  --draft dflash-draft-3.6-q8_0.gguf \
  --port 8082 --ddtree --ddtree-budget 28 --max-ctx 131072

54.7 tok/s at C=1 with 150 mixed chat prompts (short/medium/long, SSE streaming). Server-mode is ~35% slower than direct binary due to HTTP/JSON/SSE overhead, Python FastAPI GIL, and mixed prompt distribution (vs uniform HumanEval code stubs). Quality: 7/7 on 10 complex queries (math, code, reasoning, knowledge, creative).