Skip to content

feat(cpu): optional BLAS for matmul forward/backward (2.6x end-to-end speedup)#840

Open
BigJai wants to merge 1 commit intokarpathy:masterfrom
BigJai:feat/cpu-blas-matmul
Open

feat(cpu): optional BLAS for matmul forward/backward (2.6x end-to-end speedup)#840
BigJai wants to merge 1 commit intokarpathy:masterfrom
BigJai:feat/cpu-blas-matmul

Conversation

@BigJai
Copy link
Copy Markdown

@BigJai BigJai commented Mar 3, 2026

Summary

Adds opt-in BLAS integration (USE_BLAS=1) for CPU matmul operations in train_gpt2.c, replacing hand-rolled loops with cblas_sgemm calls. This closes the largest performance gap between llm.c's CPU path and PyTorch (which uses BLAS internally).

  • End-to-end training speedup: 2.63x (6454 → 2453 ms/step)
  • Matmul-only speedup: 11.6x vs the tiled kernel (kernel 1)
  • Works with OpenBLAS (Linux), Apple Accelerate (macOS), or any CBLAS-compatible library
  • All changes guarded by #ifdef USE_BLASzero behavior change when flag is off
  • Numerically correct: validation losses match within float32 tolerance

Ref: Discussion #253@karpathy greenlit BLAS/SIMD optimizations for the CPU path.

Usage

# Install OpenBLAS (Linux)
sudo apt-get install libopenblas-dev

# Build with BLAS enabled
USE_BLAS=1 make train_gpt2

# Run (BLAS uses its own threading, adjust as needed)
OMP_NUM_THREADS=6 ./train_gpt2

On macOS, USE_BLAS=1 automatically links Apple's Accelerate framework (no install needed).

Benchmarks

Hardware: AMD EPYC (Zen 1), 6 cores, 12GB RAM
Model: GPT-2 124M, B=4, T=64, 40 training steps

End-to-end training (train_gpt2)

Build Median ms/step (steps 1-9) Median ms/step (steps 21-29)
Baseline (no BLAS) 6454 6528
USE_BLAS=1 2453 2484
Speedup 2.63x 2.63x

Validation loss comparison (numerically equivalent):

Baseline: 5.325532 → 4.416570 → 4.329112 → 4.299987 → 4.291465
BLAS:     5.325531 → 4.416318 → 4.329331 → 4.300208 → 4.291763

Matmul-only (dev/cpu/matmul_forward.c)

B=8, T=1024, C=768, OC=3072 (4 runs each):

Kernel Time (ms) vs Naive vs Tiled
0: Naive 34,312 1.0x
1: Tiled (existing) 8,521 4.0x 1.0x
2: BLAS (new) 737 46.5x 11.6x

Files Changed

File Change Lines
Makefile BLAS auto-detection block (follows existing USE_CUDNN pattern) +25
train_gpt2.c #ifdef USE_BLAS in matmul_forward + matmul_backward +50
dev/cpu/matmul_forward.c Kernel 2 (BLAS) for the CPU benchmark harness +29

Total: +104 lines, clean #ifdef guards, no changes to existing code paths.

Notes

  • macOS Accelerate path compiles but was not benchmarked (Linux test only). macOS testing welcome.
  • The bias addition after SGEMM uses a simple loop; this is a negligible fraction of total matmul time.
  • BLAS libraries handle their own threading internally, so this plays well alongside OpenMP for non-matmul ops.

🤖 Generated with Claude Code

Add opt-in BLAS integration (USE_BLAS=1) for CPU matmul operations,
replacing the hand-rolled triple-loop with cblas_sgemm calls.

Benchmarked on AMD EPYC (Zen 1), 6 cores, GPT-2 124M, B=4 T=64:
- End-to-end training: 2.63x speedup (6454 ms → 2453 ms per step)
- Matmul-only (dev/cpu benchmark): 11.6x faster than tiled kernel

Supports OpenBLAS (Linux), Apple Accelerate (macOS), or any
CBLAS-compatible library. All changes are guarded by #ifdef USE_BLAS;
zero behavior change when the flag is not set.

Ref: Discussion karpathy#253 (Karpathy greenlit BLAS/SIMD optimization)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant