feat(cpu): optional BLAS for matmul forward/backward (2.6x end-to-end speedup) by BigJai · Pull Request #840 · karpathy/llm.c

BigJai · 2026-03-03T11:22:59Z

Summary

Adds opt-in BLAS integration (USE_BLAS=1) for CPU matmul operations in train_gpt2.c, replacing hand-rolled loops with cblas_sgemm calls. This closes the largest performance gap between llm.c's CPU path and PyTorch (which uses BLAS internally).

End-to-end training speedup: 2.63x (6454 → 2453 ms/step)
Matmul-only speedup: 11.6x vs the tiled kernel (kernel 1)
Works with OpenBLAS (Linux), Apple Accelerate (macOS), or any CBLAS-compatible library
All changes guarded by #ifdef USE_BLAS — zero behavior change when flag is off
Numerically correct: validation losses match within float32 tolerance

Ref: Discussion #253 — @karpathy greenlit BLAS/SIMD optimizations for the CPU path.

Usage

# Install OpenBLAS (Linux)
sudo apt-get install libopenblas-dev

# Build with BLAS enabled
USE_BLAS=1 make train_gpt2

# Run (BLAS uses its own threading, adjust as needed)
OMP_NUM_THREADS=6 ./train_gpt2

On macOS, USE_BLAS=1 automatically links Apple's Accelerate framework (no install needed).

Benchmarks

Hardware: AMD EPYC (Zen 1), 6 cores, 12GB RAM
Model: GPT-2 124M, B=4, T=64, 40 training steps

End-to-end training (`train_gpt2`)

Build	Median ms/step (steps 1-9)	Median ms/step (steps 21-29)
Baseline (no BLAS)	6454	6528
USE_BLAS=1	2453	2484
Speedup	2.63x	2.63x

Validation loss comparison (numerically equivalent):

Baseline: 5.325532 → 4.416570 → 4.329112 → 4.299987 → 4.291465
BLAS:     5.325531 → 4.416318 → 4.329331 → 4.300208 → 4.291763

Matmul-only (`dev/cpu/matmul_forward.c`)

B=8, T=1024, C=768, OC=3072 (4 runs each):

Kernel	Time (ms)	vs Naive	vs Tiled
0: Naive	34,312	1.0x	—
1: Tiled (existing)	8,521	4.0x	1.0x
2: BLAS (new)	737	46.5x	11.6x

Files Changed

File	Change	Lines
`Makefile`	BLAS auto-detection block (follows existing USE_CUDNN pattern)	+25
`train_gpt2.c`	`#ifdef USE_BLAS` in `matmul_forward` + `matmul_backward`	+50
`dev/cpu/matmul_forward.c`	Kernel 2 (BLAS) for the CPU benchmark harness	+29

Total: +104 lines, clean #ifdef guards, no changes to existing code paths.

Notes

macOS Accelerate path compiles but was not benchmarked (Linux test only). macOS testing welcome.
The bias addition after SGEMM uses a simple loop; this is a negligible fraction of total matmul time.
BLAS libraries handle their own threading internally, so this plays well alongside OpenMP for non-matmul ops.

🤖 Generated with Claude Code

Add opt-in BLAS integration (USE_BLAS=1) for CPU matmul operations, replacing the hand-rolled triple-loop with cblas_sgemm calls. Benchmarked on AMD EPYC (Zen 1), 6 cores, GPT-2 124M, B=4 T=64: - End-to-end training: 2.63x speedup (6454 ms → 2453 ms per step) - Matmul-only (dev/cpu benchmark): 11.6x faster than tiled kernel Supports OpenBLAS (Linux), Apple Accelerate (macOS), or any CBLAS-compatible library. All changes are guarded by #ifdef USE_BLAS; zero behavior change when the flag is not set. Ref: Discussion karpathy#253 (Karpathy greenlit BLAS/SIMD optimization) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cpu): optional BLAS for matmul forward/backward (2.6x end-to-end speedup)#840

feat(cpu): optional BLAS for matmul forward/backward (2.6x end-to-end speedup)#840
BigJai wants to merge 1 commit intokarpathy:masterfrom
BigJai:feat/cpu-blas-matmul

BigJai commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BigJai commented Mar 3, 2026

Summary

Usage

Benchmarks

End-to-end training (train_gpt2)

Matmul-only (dev/cpu/matmul_forward.c)

Files Changed

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

End-to-end training (`train_gpt2`)

Matmul-only (`dev/cpu/matmul_forward.c`)