You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimize loops through transformations like unrolling, fusion, tiling, and vectorization.
version
1.0.0
tags
compilation
optimization
parallelism
pldi
difficulty
advanced
languages
c++
rust
python
dependencies
dataflow-analysis-framework
ssa-constructor
Loop Optimizer
Loop optimizations transform loops to improve performance by reducing overhead, increasing parallelism, and improving cache locality. Since programs spend most time in loops, these optimizations are critical for performance.
When to Use This Skill
Building optimizing compilers
Improving numerical code performance
Enabling vectorization
Optimizing for cache hierarchy
Parallelizing loop nests
What This Skill Does
Loop Unrolling: Replicate loop body to reduce branch overhead
Loop Fusion: Combine multiple loops into one
Loop Tiling: Reorganize iteration for cache locality
Loop Vectorization: Transform for SIMD execution
Loop Interchange: Reorder nested loops for memory access
Key Concepts
Concept
Description
Loop Unrolling
Replicate body to reduce branch overhead
Loop Fusion
Combine loops with same bounds
Loop Tiling
Reorganize for cache blocks
Loop Interchange
Swap nested loop order
Vectorization
Execute multiple iterations in parallel
Tips
Profile before optimizing (identify hot loops)
Consider unrolling factor carefully (code size vs. speed)
Tile for L1 cache size (typically 32KB)
Align data for vectorization
Combine transformations synergistically
Common Use Cases
Matrix operations (multiply, transpose)
Image processing kernels
Scientific computing
Machine learning operators
Compression/decompression
Related Skills
dataflow-analysis-framework - Dependency analysis
constant-propagation-pass - Combine with loop optimizations
llvm-backend-generator - LLVM loop passes
ssa-constructor - Loop in SSA form
Canonical References
Reference
Why It Matters
Lam, Rothberg & Wolf, "The Cache Performance and Optimizations of Blocked Algorithms" (PLDI 1991)
Classic paper on tiling for cache locality
Wolf & Lam, "A Data Locality Optimizing Algorithm" (PLDI 1991)
Unified loop transformation framework
Allen & Kennedy, "Optimizing Compilers for Modern Architectures: A Dependence-based Approach" (Morgan Kaufmann, 2001)
Comprehensive textbook on loop optimization
Feautrier, "Dataflow Analysis of Array and Scalar References" (IJPP 1991)
Polyhedral model foundations
Tradeoffs and Limitations
Approach Tradeoffs
Approach
Pros
Cons
Unrolling
Reduces overhead
Code bloat
Tiling
Cache friendly
Complex bounds
Vectorization
SIMD speedup
Alignment requirements
When NOT to Use This Skill
Non-critical loops
When code size matters more than speed
When loops have complex control flow
Limitations
Requires dependency analysis
May increase code size
Hardware-specific tuning needed
Assessment Criteria
A high-quality implementation should have:
Criterion
What to Look For
Correctness
Preserves loop semantics
Performance
Measurable speedup
Generality
Handles various loop patterns
Safety
Checks for dependencies
Quality Indicators
✅ Good: Correct transformation, significant speedup, handles dependencies
⚠️Warning: Works for simple cases, fails on complex dependencies
❌ Bad: Breaks semantics, incorrect loop bounds