Historical per-wave module log. "Wave" numbers and the old
vNN.0.0scheme are the internal development cadence; the current public release is v9.34.2 (see CHANGELOG.md). For waves 1–28, see the historical record below.
| Wave | Version | Theme | Key Files |
|---|---|---|---|
| 85 | 58.0.0 | CLI color dedup + README accuracy | cli.py, server.py, api/v1_router.py |
| 86 | 59.0.0 | Observability: ProductionProfiler + squish trace |
hardware/production_profiler.py, serving/obs_report.py, cli.py |
| 87 | 60.0.0 | VSCode/Web UI agent tool execution fix | serving/tool_calling.py, agent/tool_name_map.py, squishClient.ts |
| 88 | 61.0.0 | Ollama gaps + LocalAI + squish compat |
serving/ollama_compat.py, serving/localai_compat.py, serving/backend_router.py |
| 89 | 62.0.0 | Local model scanner + squish pull URI schemes |
serving/local_model_scanner.py, cli.py |
| 90 | 63.0.0 | Lean startup profiler + server.py decomposition | serving/startup_profiler.py, serving/feature_state.py, serving/blazing.py |
| 91 | 64.0.0 | Sub-3s TTFT (blazing default) + 70B loader | server.py, cli.py, catalog.py, serving/blazing.py |
| 92 | 65.0.0 | Pre-compress pipeline + HF batch upload | catalog.py, dev/scripts/upload_to_hub.py, .github/workflows/model_upload.yml |
| 93 | 66.0.0 | macOS SquishBar: model picker, progress, hotkey | apps/macos/SquishBar/Sources/SquishBar/SquishEngine.swift, SquishMenuView.swift, Makefile |
| 94 | 67.0.0 | Cross-platform support review | platform/detector.py, platform/platform_router.py, cli.py, README.md |
| 95 | 68.0.0 | README final audit + public release | README.md, MODULES.md, cli.py, squish/__init__.py |
These waves implemented the EU AI Act, NIST AI RMF, and enterprise compliance features. That code now lives in the standalone konjoai/squash repository (pip install squash-ai). It is no longer part of the squish package.
Consolidated three duplicate terminal palette implementations into a single
squish/_term.py source of truth. cli.py and server.py now import from
_term instead of carrying their own copies. Fixed hardcoded localhost:11434
port in api/v1_router.py default URL.
Key changes:
squish/cli.py: removed local_C/_CTerminalclasses; import fromsquish._termsquish/server.py: removed duplicate_gradient(),_LOGO_GRAD, local_Csquish/api/v1_router.py: defaultserver_urlreadsSQUISH_SERVER_URLenv orlocalhost:11435
Wired trace_span into hot paths and instantiated ProductionProfiler at
server start. Added GET /v1/obs-report endpoint and squish trace CLI
command with remediation hints.
New file: squish/serving/obs_report.py — detect_bottlenecks(),
generate_report(), _REMEDIATION_HINTS dict.
Fixed truncated <tool_call> tag parsing (Strategy 0.5 added before existing
strategies), normalized VSCode tool names via agent/tool_name_map.py, fixed
30-second timeout in _toolRunTerminal, and added agent mode toggle to Web UI.
New file: squish/agent/tool_name_map.py — VSCODE_TO_BACKEND dict,
normalize_for_backend(), normalize_for_client().
Implemented /api/pull streaming, /api/ps, /api/version (dynamic), and
other previously-stubbed Ollama endpoints. Added LocalAI compatibility routes
(GET /, GET /v1/version, GET /readyz). Added squish compat command
printing client configuration snippets.
New files: squish/serving/localai_compat.py, squish/serving/backend_router.py
LocalModelScanner scans Squish, Ollama, and LM Studio model directories.
squish models shows an "External models detected" section. squish pull
accepts ollama: and hf: URI prefixes. squish import added as new command.
New file: squish/serving/local_model_scanner.py — LocalModel dataclass,
scan_squish(), scan_ollama(), scan_lm_studio(), find_all().
StartupTimer context manager + StartupReport with slowest() / to_dict(),
enabled by SQUISH_TRACE_STARTUP=1. FeatureState dataclass centralises all
_xxx = None server globals. BlazingPreset / auto_blazing_eligible moved
to serving/blazing.py.
New files: squish/serving/startup_profiler.py, squish/serving/feature_state.py,
squish/serving/blazing.py
Blazing mode auto-activates on M3/M4/M5 with ≥16 GB RAM (pass --no-blazing
to disable). cmd_run auto-selects INT2/INT3 based on available RAM vs
model size. _recommend_model() priority order fixed (was recommending llama3.3:70b
on 64+ GB machines). llama3.3:70b catalog entry added with squish_repo.
dev/scripts/upload_to_hub.py gained --all-missing, --batch-file, --int2,
--force, --org flags. catalog.py squish_repo backfilled for 5 models.
GitHub Actions model_upload.yml workflow added for CI-triggered uploads.
SquishBar gained: model picker with active-model checkmark (switchModel()),
"Pull Model…" button with live compression progress bar, global hotkey (⌘⌥S
default, configurable in Settings…), and Makefile release + dmg targets.
New docs/squishbar.md reference page.
README title, badge, and Requirements section updated for multi-platform.
cmd_setup() no longer calls sys.exit(1) on non-Apple platforms; instead
detects backend via get_inference_backend(detect_platform()) and prints
guidance. platform/ module verified with is_apple_silicon, is_cuda,
name, platform_name, and get_inference_backend() all confirmed present.
_CURRENT_WAVE = 95 constant added to cli.py. cmd_version / squish version
subcommand prints version + wave from importlib.metadata. README model count
updated to 40. MODULES.md backfilled with Waves 85–95 summary.
CHANGELOG fully populated through v68.0.0.
All five changes are in squish/server.py. They wire pre-existing modules into
the live request path with minimal overhead.
File: squish/streaming/chunked_prefill.py
Flag: --chunk-prefill (off by default; --chunk-prefill-threshold N)
Change: Removed the _on_compress_path gate so chunked prefill works on
every request path, not just compressed-weight paths.
Impact: TTFT −40–60% on prompts > threshold (default 512 tokens).
File: squish/hardware/fused_sampler.py
Flag: enabled by default; disable with --no-fused-sampler
Change: FusedSampler (fused temperature/top-k/top-p/min-p/rep-penalty) is
now the default decode-step sampler, replacing the 4-pass manual chain.
Impact: Sampling latency ~0.35 ms → ~0.08 ms (~4× faster).
File: squish/kv/cache_warmup.py
Flag: enabled by default; disable with --no-cache-warmup
Change: record_access(input_ids[:256], timestamp) is called after
tokenization on every request, enabling predictive pre-warming for repeat
system prompts and frequent prefixes.
Impact: TTFT −20–40% on repeated prefixes (system prompt reuse, chat turns).
File: squish/token/token_merging.py
Flag: --token-merge (off by default)
Change: patch_model_tome() / unpatch_model_tome() are called around the
standard prefill model call for sequences ≥ 64 tokens (layers 4–11).
Impact: Prefill FLOP −18–34% depending on sequence length; PPL delta < 2%.
File: squish/token/layer_skip.py
Flag: --layer-skip (off by default)
Change: ConfidenceEstimator is initialised once per request; each decode
step estimates logit entropy and attempts model(x, layer_limit=exit_layer)
when confidence exceeds threshold. Fallback to full model on TypeError.
Impact: Decode TPS +15–22% on high-confidence generation tasks.
Path: squish/speculative/cascade_spec.py
Flag: --cascade-spec
Purpose: Two-stage speculative decoding combining an EAGLE-3 tree draft
with n-gram lookahead extension.
Key classes:
| Class | Role |
|---|---|
CascadeSpecConfig |
Dataclass holding eagle_depth, ngram_extend, ngram_order, temperature |
CascadeSpecDecoder |
Main decoder; .generate(prompt_ids, max_new_tokens, eos_id) |
CascadeSpecStats |
Latency / acceptance-rate counters |
Algorithm:
- EAGLE-3 tree draft builds candidate tokens from a heuristic head (or loaded
EAGLE-3 head via
set_eagle_head()). - N-gram lookahead extends each tree leaf by
ngram_extendpositions. - Full model verifies the tree; greedy-accept prefix up to first mismatch.
- Stats track
mean_accept_lenanddraft_callsper generation.
Expected throughput: 2.5–3× vs greedy decode on typical prompts.
Path: squish/streaming/adaptive_prefill_fusion.py
Flag: --adaptive-prefill
Purpose: Classifies prompt complexity from token-frequency entropy and
returns a PrefillPlan describing which prefill optimisations to enable.
Key classes:
| Class | Role |
|---|---|
PrefillComplexity |
HIGH / MEDIUM / LOW enum |
PrefillFusionConfig |
Entropy thresholds + per-complexity settings |
PrefillPlan |
Output: use_chunked, use_tome, use_layer_skip, use_ngram |
PrefillFusionController |
.plan(token_ids) → PrefillPlan |
Complexity routing:
- HIGH (diverse/creative): chunked prefill only; no ToMe (entropy too high)
- MEDIUM (chat/QA): ToMe (layers 4–11) + chunked prefill
- LOW (code/templates): ToMe + LayerSkip + n-gram lookahead
Overhead: single entropy estimation pass ~0.01 ms on 2048-token prompts.
Path: squish/speculative/draft_multiplexer.py
Flag: --draft-multiplex
Purpose: Selects the best available draft strategy at runtime using
per-task EMA acceptance rates and throughput scores.
Key classes:
| Class | Role |
|---|---|
DraftStrategy |
NGRAM / EAGLE / MEDUSA / HYDRA / CASCADE enum |
DraftTaskType |
CODING / MATH / RAG / CONVERSATION / UNKNOWN |
DraftMultiplexerConfig |
EMA alpha, cost weight, min samples before EMA |
StrategyStats |
Per-strategy acceptance_rate, tps, n_samples |
DraftMultiplexer |
.select(prompt) → DraftStrategy; .update(strategy, task_type, rate, tps) |
Selection logic:
- Round-robin during init phase (<
min_samplesper strategy) - Regex task classifier: coding/math/RAG/conversation patterns
- EMA score =
acceptance_rate + cost_weight × normalised_tps - Highest score among available strategies wins
Expected gain: +5–7 pp acceptance rate vs fixed strategy selection.
Path: squish/kernels/async_decode_overlap.py
Flag: --async-decode-overlap
Purpose: Pipelines CPU sampling computation for step N with the GPU
(Metal) kernel for step N+1 using a background thread and queue.
Key classes:
| Class | Role |
|---|---|
OverlapConfig |
timeout_ms, max_queue_depth, fallback_sync |
AsyncDecodeOverlap |
.decode_loop(model_forward, first_token_id, max_tokens, eos_id) → Generator[int] |
OverlapStats |
overlap_steps, fallback_steps, timeout_steps |
Algorithm:
- Step N logits sent to background thread for
_sample_np(numpy argmax/top-k) - GPU launches step N+1 kernel while background thread samples step N
queue.SimpleQueuepasses sampled tokens back; timeout forces sync fallback- Overlap rate typically 80–90%; throughput gain +5–10% decoded TPS
Path: squish/attention/per_layer_sparse_attn.py
Flag: --per-layer-sparse
Purpose: Profiles attention head entropy during prefill, then applies a
per-head sparse attention mask during decode for low-entropy (predictable) heads.
Key classes:
| Class | Role |
|---|---|
PerLayerSparseConfig |
entropy_threshold, warmup_steps, ema_alpha, n_layers, n_heads |
HeadProfile |
Per-head EMA entropy + is_sparse flag |
PerLayerSparseAttn |
.profile_prefill(attn_weights_4d) → .sparse_mask(layer) → bool[n_heads] |
Algorithm:
- During prefill: compute entropy of
mean_over_queries(attn_weights)per head - EMA-smooth across requests:
ema = alpha * new + (1-alpha) * old - After
warmup_steps: heads withema_entropy < entropy_threshold→is_sparse = True - Decode:
sparse_mask(layer)returns bitmask for caller to skip compute
Expected reduction: 15–25% attention FLOP in decode on typical prompts; quality impact < 0.5% PPL increase.
Path: squish/speculative/speculative_prefill.py
Flag: --spec-prefill (requires --draft-model)
Purpose: Reduces TTFT by running a draft model over the full prompt to
produce KV states, then having the target model only recompute layers where
the KV diverges (cosine similarity below threshold).
Key classes:
| Class | Role |
|---|---|
SpecPrefillConfig |
similarity_threshold, max_skip_rate, chunk_size |
SpecPrefillStats |
skip_rate, speedup_estimate, recompute_layers |
SpeculativePrefiller |
.prefill(token_ids) → (kv_states, stats) |
Algorithm:
- Draft model forward pass produces KV for all layers
- Consecutive-layer cosine similarity of K matrices used as KV-agreement proxy
- Layers with similarity ≥ threshold are marked for skipping
recompute_maskpassed to target forward; target only runs unmasked layersspeedup_estimate = 1 / (1 − skip_rate)
Expected TTFT reduction: 10% (256 tok) → 22% (4096 tok) when draft and target share architecture.
| Test file | Tests | Status |
|---|---|---|
tests/test_wave27_server_wiring.py |
33 | ✅ passing |
tests/test_wave28_server_wiring.py |
77 | ✅ passing |
| Full suite | 7,672 | ✅ passing |
python dev/benchmarks/bench_wave27_28.py [--runs N] [--vocab N] [--output path]Results saved to dev/results/wave27_28_bench.json.
Reference table: see the per-wave entries below.
| Wave | Version | Theme | New Files |
|---|---|---|---|
| 85 | 58.0.0 | CLI color dedup + README accuracy | — |
| 86 | 59.0.0 | Observability: profiler wiring + squish trace |
squish/serving/obs_report.py |
| 87 | 60.0.0 | Agent tool execution fix | squish/agent/tool_name_map.py |
| 88 | 61.0.0 | Ollama/LocalAI compat gaps | squish/serving/localai_compat.py, squish/serving/backend_router.py |
| 89 | 62.0.0 | Local model scanner + squish pull URI schemes |
squish/serving/local_model_scanner.py |
| 90 | 63.0.0 | Startup profiler + core module extraction | squish/serving/startup_profiler.py, squish/serving/feature_state.py, squish/serving/blazing.py, dev/scripts/import_scan.py |
| 91 | 64.0.0 | Sub-3s TTFT + 70B INT2 loader | — |
| 92 | 65.0.0 | Pre-compress pipeline + HF batch upload | — |
| 93 | 66.0.0 | macOS SquishBar polish | docs/squishbar.md |
| 94 | 67.0.0 | Cross-platform support review | — |
| 95 | 68.0.0 | README final audit + public release | — |
Phase-level startup timing via StartupTimer context manager and StartupReport.
SQUISH_TRACE_STARTUP=1 enables tracing; result accessible at GET /v1/startup-profile.
FeatureState dataclass centralises ~90 previously scattered _xxx = None globals
from server.py into a typed, importable structure.
M3/M4/M5 auto-blazing eligibility (auto_blazing_eligible), BlazingPreset dataclass,
and get_preset(chip, ram_gb) which selects INT4 for ≥ 24 GB RAM configs.
LocalModelScanner discovers Squish, Ollama, and LM Studio models from standard
local directories and exposes them through /api/tags for OpenWebUI compatibility.
AST-based import dependency analyzer. Report A: orphan modules (zero inbound imports).
Report B: server.py globals assigned only None (dead feature flags).
--no-blazingflag disables auto-activation on M3+ for users preferring full context window over sub-3s TTFT.- RAM-aware quant auto-selection: INT2 when model > 75% RAM, INT3 when > 55%.
llama3.3:70bwired with INT2 catalog entry and"impossible"tag.
PlatformInfo (frozen dataclass in squish/platform/detector.py) now exposes:
.is_apple_silicon— True whenkind == MACOS_APPLE_SILICON.is_cuda— alias forhas_cuda.name— lower-case kind string (e.g."macos_apple_silicon").platform_name— human-readable (e.g."Apple Silicon (M3 Pro)")
detect_platform() module-level convenience function added.
get_inference_backend(platform) in platform_router.py returns
"mlx" | "torch_cuda" | "torch_rocm" | "torch_cpu".
| Test file | Tests |
|---|---|
tests/test_wave89_local_model_scan.py |
36 |
tests/test_wave90_startup_lean.py |
33 |
tests/test_wave91_performance.py |
32 |
tests/test_wave92_presquish.py |
25 |
tests/test_wave93_squishbar.py |
37 |
tests/test_wave94_cross_platform.py |
29 |
tests/test_wave95_release.py |
TBD |