Curated kernels from KernelBench Level 2 and the Intel XPU Triton benchmarks, organized by optimization pattern. Each example includes a Triton kernel (.py) and a spec (.yaml). Source files live in test_kernels/; the examples/ directory provides a categorized view via symlinks.
xe-forge -i examples/gemm/14_Gemm_Divide_Sum_Scaling.py \
-s examples/gemm/14_Gemm_Divide_Sum_Scaling.yaml \
-o optimized.pyGEMM with post-matmul elementwise or reduction operations.
| Kernel | Operations |
|---|---|
| 14_Gemm_Divide_Sum_Scaling | GEMM + divide + column sum + scaling |
| 39_Gemm_Scale_BatchNorm | GEMM + scaling + batch normalization |
| 45_Gemm_Sigmoid_LogSumExp | GEMM + sigmoid + log-sum-exp reduction |
Long activation chains fused into a single kernel.
| Kernel | Operations |
|---|---|
| 81_Gemm_Swish_Divide_Clamp_Tanh_Clamp | GEMM + swish + divide + clamp + tanh + clamp |
| 95_Matmul_Add_Swish_Tanh_GELU_Hardtanh | Matmul + add + swish + tanh + GELU + hardtanh |
| 99_Matmul_GELU_Softmax | Matmul + GELU + softmax |
Kernels with reduction passes (batch norm, softmax).
| Kernel | Operations |
|---|---|
| 84_Gemm_BatchNorm_Scaling_Softmax | GEMM + batch norm + scaling + softmax |
| Kernel | Operations |
|---|---|
| 1_FlashAttention_Fwd | Flash Attention forward (Q @ K, softmax, @ V) |
Matmul combined with pooling, min/max, or other non-standard operations.
| Kernel | Operations |
|---|---|
| 55_Matmul_MaxPool_Sum_Scale | Matmul + max pool + sum + scaling |
| 68_Matmul_Min_Subtract | Matmul + row min + subtract |
- Add kernel
.pyand spec.yamltotest_kernels/ - Symlink into the appropriate
examples/category:cd examples/gemm ln -s ../../test_kernels/MyKernel.py . ln -s ../../test_kernels/MyKernel.yaml .
- Update this file