EdgeRunner is a from-scratch LLM inference engine in Swift/Metal for Apple Silicon. Built in ~13,000 lines of code across 26 optimization experiments, it produces correct, coherent text at competitive speed.
Model: Qwen3-0.6B Q8_0 | Device: Apple M3 Max | 128-token decode
MLX (Python): 277.8 tok/s ████████████████████████████ target
EdgeRunner (Swift): 234.8 tok/s ███████████████████████▋ 84.5%
llama.cpp: 200.3 tok/s ████████████████████▏ 72.1%
Time to First Token: EdgeRunner 3.4ms vs MLX 77.2ms (22x faster)
Peak Memory: EdgeRunner 4,980MB vs MLX 638MB (7.8x worse)
- Full Llama/Qwen transformer inference on Metal GPU
- Q8_0 quantized weight support (GGUF format)
- KV cache with incremental decode (only processes new token)
- Fused Metal kernels: RMSNorm+GEMV, RoPE+GQA, SwiGLU, dequant+GEMV+residual
- Single-simdgroup GQA mega-kernel (zero threadgroup barriers)
- Correct, coherent output: "The capital of France is Paris", "2 + 2 = 4"
- 22x faster time-to-first-token than MLX
- No BPE tokenizer (can't accept text input, only pre-tokenized IDs)
- Only 1 model supported (Qwen3-0.6B)
- Only Q8_0 quantization
- 8x memory bloat (stores redundant float32 weight copies)
- No streaming output, no sampling (temperature/top-p), no chat interface
Goal: 280+ tok/s decode, <1GB memory
Three kernel-level changes close the gap. No new features, no architecture redesign. Pure autoresearch: modify, benchmark, keep or rollback.
- What: Remove redundant dequantized float32 weight buffers (2.4GB wasted)
- Why: Decode path uses only raw Q8_0 buffers. Float32 copies exist for a prefill fallback that can use Q8_0 fused GEMV instead
- Impact: 4,980MB -> ~2,500MB memory. Better DRAM bandwidth (less page table pressure)
- Risk: Low. Dummy buffer approach avoids changing any call sites
- Effort: 1 day
- What: Merge the final RMSNorm dispatch into the LM Head GEMV kernel
- Why: Eliminates 1 GPU dispatch per decode (saves ~3us scheduling gap x 28 layers equivalent). Each GEMV threadgroup computes norm cooperatively, same pattern already used in fused QKV and Gate+Up+SiLU kernels
- Impact: 141 -> 140 dispatches. ~3us saved. Marginal but architecturally clean
- Risk: Low. Same cooperative RMSNorm pattern proven in Experiments 10-12
- Effort: 1 day
- What: Cache per-block scale factors in registers instead of re-reading from DRAM
- Why: Q8_0 reads 2-byte scale per 32-weight block. Scale accounts for 6% of weight bandwidth. Caching scale across the simdgroup reduces DRAM reads
- Impact: ~5-10% GEMV kernel improvement. 234.8 -> ~250 tok/s
- Risk: Medium. Requires careful register pressure management
- Effort: 2-3 days
- What: Warm GPU pipeline at diverse kvSeqLen values (1, 4, 16, 32, 64, 128) instead of just 5 consecutive positions
- Why: GPU profiling showed first-call JIT penalty at new kvSeqLen values
- Impact: Eliminates residual cold-start penalties across the 128-token benchmark
- Risk: Very low. One-time cost at first prefill
- Effort: 0.5 day
- 128-token decode throughput > 278 tok/s (beat MLX)
- Peak memory < 2,500 MB
- Correctness: greedy tokens [1, 1479, 35, 5371, 1] unchanged
- Coherence: "The capital of France is Paris" still works
Goal: Accept text input, produce text output. No Python dependency.
- What: Load vocabulary + merge rules from GGUF metadata, implement BPE encode/decode in Swift
- Why: Currently uses raw UTF-8 bytes as "tokens" which is wrong. GGUF already contains the full Qwen3 BPE vocabulary (151,936 tokens) and merge table (151,387 rules)
- Implementation: Port the GPT-2 BPE algorithm (byte-level BPE with merge priorities)
- Effort: 3-5 days
- What: Parse and apply Jinja2-style chat templates from GGUF metadata
- Why: Qwen3 requires
<|im_start|>user\n...<|im_end|>formatting. The template is already in the GGUF file - Implementation: Minimal Jinja2 subset (variable substitution, for loops, conditionals). No full Jinja2 needed
- Effort: 2-3 days
- What: Temperature, top-p, top-k, repetition penalty, min-p
- Why: Greedy decode produces repetitive output. Real generation needs sampling
- Implementation:
SamplingConfigurationstruct already exists.GreedySamplerandMinPSamplerare in EdgeRunnerCore. Wire them into the Llama decode path - Effort: 1-2 days
-
model.generate(prompt: "What is the meaning of life?")returns coherent multi-sentence response - Chat mode with proper turn-taking
- No Python dependency for tokenization
- Temperature and top-p sampling produce varied, high-quality output
Goal: Run Llama 3, Mistral, Phi-3, Gemma on the same engine
- What: Factor model-specific config (RoPE type, norm type, activation, head arrangement) out of LlamaLanguageModel into a config-driven architecture
- Why: Currently hardcoded for Qwen3's NeoX RoPE + per-head Q/K norm. Llama 3 uses standard RoPE without Q/K norm. Mistral uses sliding window attention
- Effort: 1 week
- What: Add fused dequant+GEMV kernel for Q4_K_M (the most popular GGUF quantization)
- Why: Q4_K_M uses ~4.5 bits/weight vs Q8_0's 8.5. Halves memory and bandwidth. Most models are distributed in Q4_K_M
- Impact: ~2x faster decode (half the weight data). 7B models fit in 4GB RAM
- Effort: 1 week (kernel + Swift integration)
- What: Test and validate each architecture variant
- Why: Coverage. These are the most widely used open models
- Effort: 2-3 days per model family (config + integration tests)
-
LlamaLanguageModel.load(from: anyGGUF)works for 5+ model families - Q4_K_M decode throughput > 400 tok/s for 0.6B models
- Llama-3.1-8B-Q4_K_M runs at > 30 tok/s in < 6GB memory
- All models produce coherent output verified by coherence tests
Goal: An API that app developers can ship
let model = try await EdgeRunner.load("Qwen3-0.6B-Q8_0.gguf")
for try await token in model.stream("Tell me a story") {
print(token, terminator: "")
}- What:
AsyncSequence-based token streaming with backpressure - Why: Users expect to see tokens appear one at a time
- Implementation:
GenerationSessionandTokenStreamalready exist in the codebase. Wire them to LlamaLanguageModel
- What: mmap the GGUF file instead of reading into memory
- Why: Load time drops from ~2 seconds to < 100ms. OS manages page faults
- Impact: Instant model loading, shared memory across processes
- What: Memory pressure monitoring, context length reduction, quantization fallback
- Why: Currently crashes on OOM. Must degrade gracefully for production use
- Clean Swift Package Manager integration
- Streaming output with < 5ms time-to-first-token
- Model loading < 200ms
- No crashes under memory pressure
Goal: A macOS/iOS app or framework that regular people use
- What: Lightweight always-available LLM chat in the menu bar
- Why: Prove the technology works in a real product
- Implementation: SwiftUI, local model management, conversation persistence
- What: Same engine running on iPhone/iPad
- Why: Apple Silicon is the same architecture. Metal kernels work unchanged
- Challenges: Memory constraints (iPhone has 6-8GB shared). Need Q4_K_M for 7B models
- What: Implement Apple's Foundation Models protocol for system-level integration
- Why: Enables EdgeRunner models to be used by any app via the system ML framework
- Implementation:
FoundationModelsBackend.swiftalready exists as a stub
- App Store submission
- Real users running local LLMs on their devices
- The 10/10
| Metric | Today | Phase 1 | Phase 3 | Phase 5 |
|---|---|---|---|---|
| Decode (0.6B, 128tok) | 235 tok/s | 280+ tok/s | 400+ tok/s (Q4) | 400+ tok/s |
| Decode (7B, 128tok) | N/A | N/A | 30+ tok/s | 40+ tok/s |
| Memory (0.6B) | 4,980 MB | <2,500 MB | <800 MB | <700 MB |
| Memory (7B) | N/A | N/A | <6 GB | <5 GB |
| TTFT | 3.4 ms | 3.4 ms | <10 ms | <5 ms |
| Model load | ~2s | ~2s | <200ms | <100ms |
| Models supported | 1 | 1 | 5+ | 10+ |
| Quantization formats | Q8_0 | Q8_0 | Q4_K_M, Q8_0 | Q4_0-Q8_0, F16 |
| Issue | Severity | Phase to Fix |
|---|---|---|
| 4.9GB peak memory (redundant float32 caches) | High | Phase 1 |
| No BPE tokenizer (UTF-8 byte hack) | High | Phase 2 |
| Only Qwen3 architecture supported | High | Phase 3 |
| Prefill seqLen>1 uses separate (slower) code path | Medium | Phase 1 |
| Metal 4 path not tested on real macOS 26 hardware | Medium | Phase 3 |
| LlamaLanguageModel.swift is 2,337 lines | Medium | Phase 3 |
| No error recovery on GPU OOM | Medium | Phase 4 |
| Decode warmup adds ~18ms to first generation | Low | Acceptable |
| Non-deterministic output across benchmark runs | Low | Phase 3 |
| GPT-2 module (Module/, Transformer/) is unused dead code | Low | Phase 3 |
- Fused kernels — merging RMSNorm into GEMV, RoPE into GQA: +67% (Exp 10)
- Single command buffer — 1 encoder for entire forward pass: +191% (Exp 4)
- Q8_0 fused dequant+GEMV — read quantized weights directly: +41% (Exp 6)
- Single-simdgroup GQA — zero barriers, 4 dims/thread: +13% (Exp 21)
- Decode warmup — pre-warm GPU pipeline states: +16% (Exp 16)
- f16 accumulation — NaN from precision loss (Exp 22)
- NR=4 GEMV — register pressure kills occupancy (Exp 15a)
- 2-simdgroup GEMV — occupancy loss exceeds dispatch savings (Exp 19)
- Flash-decode (extra dispatches) — dispatch overhead > parallelism gain at 128 tokens (Exp 26)
- Swift COW reusable arrays — COW semantics fight in-place mutation (Exp 25)
Apple Silicon GEMV is occupancy-limited: more small threadgroups with 1 simdgroup (32 threads) each outperform fewer large threadgroups. Bandwidth utilization = f(outstanding memory requests) = f(active threadgroups). Any optimization that reduces threadgroup count—even if it reduces total work—hurts effective bandwidth.
The remaining gap to MLX is structural, not algorithmic: Q8_0 reads 20% more data per weight than MLX's 8-bit format, and 142 dispatches create 0.31ms of GPU scheduling overhead that MLX's graph-level lazy evaluation avoids. Closing this gap requires format-level and dispatch-level changes, not kernel-level tuning.
March 2026
0.058 tok/s ▏ Experiment 0: baseline (broken)
3.57 tok/s █ Exp 1: GPU LM head + buffer cache
16.1 tok/s ████ Exp 4: single command buffer
24.0 tok/s ██████ Exp 6: fused Q8_0 dequant+GEMV
64.0 tok/s ████████████████ Exp 9: KV cache decode
120 tok/s ██████████████████████████████ Exp 10: kernel fusion
150 tok/s █████████████████████████████████████ Exp 12: fused prefill
210 tok/s ████████████████████████████████████████████████████ Exp 13: mega-kernel
235 tok/s ███████████████████████████████████████████████████████████ Exp 21: 1-SG GQA
Total improvement: 4,052x from baseline
vs llama.cpp: +17%
vs MLX: -15% (closing)