Just a Byte

Hi! I'm Patrick Toulme, a compiler and performance engineer. I write Just a Byte — a blog about AI compilers, silicon, and systems.

This repo contains companion code, IR dumps, and reproduction scripts for the blog posts.

Posts

Post 1: From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack Traces 8 lines of JAX through the full TPU compiler pipeline — from HLO through optimization passes to 250 VLIW bundles across five fused kernels.

Post 2: When XLA Isn't Enough: From Pallas to VLIW with Splash Attention on TPU Explores the limits of XLA's automatic optimization for attention and how Pallas custom kernels achieve 6x fewer VLIW bundles and 37x less HBM traffic.

Post 3: CuTile on Blackwell: NVIDIA's Compiler Moat Is Already Built Traces a Mixture of Experts kernel through NVIDIA's CuTile stack — 86 lines of Python compiled into 1,900 lines of optimized PTX with tcgen05 instructions.

Post 4: Frontier Pretraining Infrastructure Is Already Open Source: GPT-OSS on TPU with MaxText Shows how MaxText and XLA compress 11,207 HLO instructions into 887 fused kernels, arguing frontier training infra is already available in open source.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cutile_blackwell_post		cutile_blackwell_post
maxtext_pretraining		maxtext_pretraining
tpu_compiler_post		tpu_compiler_post
tpu_pallas_post		tpu_pallas_post
README.md		README.md