diff --git a/‎.claude/skills/liger-kernel-dev/SKILL.md‎
Lines changed: 63 additions & 0 deletions b/‎.claude/skills/liger-kernel-dev/SKILL.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎.claude/skills/liger-kernel-dev/analyzer.md‎
Lines changed: 62 additions & 0 deletions b/‎.claude/skills/liger-kernel-dev/analyzer.md‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎.claude/skills/liger-kernel-dev/examples/cross-entropy-profile.md‎
Lines changed: 67 additions & 0 deletions b/‎.claude/skills/liger-kernel-dev/examples/cross-entropy-profile.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎.claude/skills/liger-kernel-dev/examples/rms-norm-profile.md‎
Lines changed: 64 additions & 0 deletions b/‎.claude/skills/liger-kernel-dev/examples/rms-norm-profile.md‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎.claude/skills/liger-kernel-dev/examples/swiglu-profile.md‎
Lines changed: 56 additions & 0 deletions b/‎.claude/skills/liger-kernel-dev/examples/swiglu-profile.md‎
Lines changed: 56 additions & 0 deletions
@@ -0,0 +1,63 @@
+---
+name: liger-kernel-dev
+description: "Develops production-ready Triton kernels for Liger Kernel. Creates new kernels from PyTorch operations (local files, URLs, code snippets, or natural language) with ops, module wrappers, functional APIs, unit tests, benchmarks, and plots. Also modifies existing Liger kernels. Use when adding a new Triton kernel, converting a PyTorch operation to Triton, or updating an existing Liger kernel."
+---
+
+# Liger Kernel Dev
+
+Develops Triton kernels for Liger Kernel through a 3-stage pipeline with human review between stages. Supports creating new kernels and modifying existing ones. NVIDIA GPUs only.
+
+## Mode Detection
+
+- **Create mode**: User asks to create/add/generate/write/build a new kernel → full pipeline
+- **Modify mode**: User asks to update/fix/change/extend an existing kernel → skip Analyze, modify files, then Validate
+
+## Pipeline (Create Mode)
+
+### Stage 1: Analyze
+
+Spawn an **Analyzer** agent (read [analyzer.md](analyzer.md)).
+
+Accepts any input: local file, URL, code snippet, natural language description, or model component reference. Produces a standalone PyTorch reference implementation and a kernel profile.
+
+**Human checkpoint:** Present PyTorch reference + kernel profile. Confirm before proceeding.
+
+### Stage 2: Generate
+
+Spawn a **Generator** agent (read [generator.md](generator.md)).
+
+Generates/modifies up to 8 files:
+
+1. `src/liger_kernel/ops/{kernel}.py` — NEW Triton kernels + autograd Function
+2. `src/liger_kernel/transformers/{kernel}.py` — NEW nn.Module wrapper
+3. `src/liger_kernel/transformers/functional.py` — MODIFY add functional API
+4. `src/liger_kernel/ops/__init__.py` — MODIFY export Function class
+5. `src/liger_kernel/transformers/__init__.py` — MODIFY export Module + `__all__`
+6. `test/transformers/test_{kernel}.py` — NEW unit tests
+7. `benchmark/scripts/benchmark_{kernel}.py` — NEW benchmark script
+8. `benchmark/data/all_benchmark_data.csv` — MODIFY (after benchmarks run)
+
+**Human checkpoint:** Present changes for review.
+
+### Stage 3: Validate
+
+Spawn a **Validator** agent (read [validator.md](validator.md)).
+
+Runs checkstyle, unit tests (hard gate — stops on persistent failure), benchmarks, and generates plots. Optionally runs ncu profiling.
+
+**Human checkpoint:** Report final results with benchmark numbers and plots.
+
+## Pipeline (Modify Mode)
+
+1. Read existing kernel files to understand current implementation
+2. Understand the requested modification
+3. Make targeted changes (Generator handles this)
+4. Run full Validate stage (same as create mode)
+
+## Reference Files
+
+- [kernel-profile-format.md](kernel-profile-format.md) — Kernel profile schema and field descriptions
+- [examples/swiglu-profile.md](examples/swiglu-profile.md) — Tier 1 (element-wise) reference
+- [examples/rms-norm-profile.md](examples/rms-norm-profile.md) — Tier 2 (reduction) reference
+- [examples/cross-entropy-profile.md](examples/cross-entropy-profile.md) — Tier 3 (fused/complex) reference
+- Templates in [templates/](templates/) — Code generation patterns for each file type
@@ -0,0 +1,62 @@
+# Analyzer Agent
+
+Understands a PyTorch operation from any input form and produces a standalone PyTorch reference implementation + kernel profile.
+
+## Input Handling
+
+The user may provide the operation in any form:
+
+1. **Local file path** → Read the file directly
+2. **URL** (GitHub, HuggingFace, etc.) → Fetch via WebFetch tool
+3. **Code snippet** → Pasted in the conversation
+4. **Natural language** → Mathematical description (e.g., "element-wise SiLU(x) * y")
+5. **Model component** → e.g., "the MLP in Phi-4" — locate in transformers source and extract
+
+## Steps
+
+### 1. Understand the Operation
+
+- Read/fetch the source code from whatever input was provided
+- Identify the mathematical operation (forward pass)
+- Derive the backward pass (gradient computation)
+- Identify all inputs, outputs, and their expected shapes/dtypes
+- Note any precision-sensitive operations that need float32 upcasting (sigmoid, rsqrt, exp, log, tanh)
+
+### 2. Write PyTorch Reference
+
+Create a standalone implementation that:
+- Depends only on `torch` (no external libraries)
+- Implements both forward and backward behavior (either as an `nn.Module` or a plain function)
+- Will serve as the correctness baseline for testing
+- Is clean, readable, and well-named
+
+### 3. Classify Into Tier
+
+Read [kernel-profile-format.md](kernel-profile-format.md) for the full schema.
+
+**Tier 1 — Element-wise**: No reductions across dimensions. One row per program. Examples: SwiGLU, GeGLU, DyT.
+- Read reference: `src/liger_kernel/ops/swiglu.py`
+
+**Tier 2 — Reduction**: Cross-column reductions (tl.sum, tl.max). May need to save intermediate state for backward. May need SM-based parallelism for weight gradient reduction. Examples: RMSNorm, LayerNorm, Softmax, Sparsemax.
+- Read reference: `src/liger_kernel/ops/rms_norm.py`
+
+**Tier 3 — Fused/Complex**: Multi-pass algorithms, gradient-in-forward tricks, multiple outputs. Examples: CrossEntropy, FusedLinearCrossEntropy.
+- Read reference: `src/liger_kernel/ops/cross_entropy.py`
+
+Also read the closest example profile:
+- Tier 1 → [examples/swiglu-profile.md](examples/swiglu-profile.md)
+- Tier 2 → [examples/rms-norm-profile.md](examples/rms-norm-profile.md)
+- Tier 3 → [examples/cross-entropy-profile.md](examples/cross-entropy-profile.md)
+
+### 4. Produce Kernel Profile
+
+Fill in all fields from [kernel-profile-format.md](kernel-profile-format.md).
+
+### 5. Present to User
+
+Show:
+1. The PyTorch reference implementation (full code)
+2. The kernel profile (all fields)
+3. Which existing kernel is closest (for the Generator to use as reference)
+
+Wait for user confirmation before proceeding to Stage 2.
@@ -0,0 +1,67 @@
+# Kernel Profile: CrossEntropy (Tier 3 — Fused/Complex)
+
+## Identity
+- operation_name: cross_entropy
+- function_class_name: LigerCrossEntropyFunction
+- module_class_name: LigerCrossEntropyLoss
+- functional_name: liger_cross_entropy
+
+## Classification
+- tier: 3
+- tier_description: fused/complex
+- closest_existing_kernel: fused_linear_cross_entropy (extends this with linear layer fusion)
+
+## Forward Pass
+- forward_inputs:
+  - _input: shape (B*T, V), logits
+  - target: shape (B*T,), label indices
+  - weight: shape (V,) or None, class weights
+  - ignore_index: int, label to ignore
+  - label_smoothing: float
+  - reduction: str, "mean" | "sum" | "none"
+  - softcap: float or None
+- forward_outputs:
+  - loss: scalar or shape (B*T,) if reduction="none"
+  - z_loss: optional auxiliary loss
+  - token_accuracy: optional accuracy metric
+  - predicted_tokens: optional argmax tokens
+- forward_computation: Two-pass online softmax + cross entropy loss with optional smoothing and softcapping
+- precision_sensitive_ops: [exp, log]
+
+## Backward Pass
+- backward_saved_tensors: [_input] — gradient is computed during forward and stored in-place
+- backward_recompute: none (gradient-in-forward trick)
+- gradient_formulas:
+  - d_input: already computed in forward pass and stored in _input tensor. Backward just scales by grad_output.
+
+## Tiling Strategy
+- grid_dimensions: 1D
+- grid_description: one program per row (one row = one token's logits over vocab)
+- block_size_source: custom — iterates over vocab in chunks of BLOCK_SIZE
+- needs_sm_parallelism: false
+
+## Module Parameters
+- module_init_params:
+  - weight: optional class weights
+  - ignore_index: int = -100
+  - lse_square_scale: float = 0.0
+  - label_smoothing: float = 0.0
+  - reduction: str = "mean"
+  - softcap: float or None = None
+  - return_z_loss: bool = False
+- learnable_params: none
+
+## Benchmarking
+- benchmark_variable: vocab_size (V)
+- benchmark_x_label: "V"
+- benchmark_x_values_suggestion: [4096, 8192, 16384, 32768, 65536, 131072]
+- benchmark_providers: ["liger", "huggingface"]
+- benchmark_fixed_config: {B: 8, T: 512, dtype: torch.bfloat16}
+
+## Key Patterns
+
+- **Online softmax**: Two-pass algorithm. Pass 1: compute running max and logsumexp. Pass 2: compute softmax and gradients. Avoids materializing the full softmax vector.
+- **Gradient-in-forward trick**: The forward kernel computes the gradient and stores it directly in `_input` (overwriting logits). The backward pass just retrieves this and multiplies by `grad_output`. This saves having to recompute softmax in backward.
+- **Constexpr flags for code elimination**: `HAS_WEIGHT`, `HAS_SOFTCAPPING`, `HAS_GRADIENTS`, `RETURN_Z_LOSS`, `RETURN_TOKEN_ACCURACY` — each is `tl.constexpr`, so the compiler removes unused code paths entirely.
+- **Chunked vocab iteration**: The kernel loops over vocabulary in `BLOCK_SIZE` chunks: `for i in range(0, n_cols, BLOCK_SIZE)`. This handles arbitrarily large vocabularies without requiring BLOCK_SIZE >= n_cols.
+- **Multiple loss components**: Combines original CE loss, label smoothing loss, and z-loss (for training stability). Each component contributes to both loss and gradient.
@@ -0,0 +1,64 @@
+# Kernel Profile: RMSNorm (Tier 2 — Reduction)
+
+## Identity
+- operation_name: rms_norm
+- function_class_name: LigerRMSNormFunction
+- module_class_name: LigerRMSNorm
+- functional_name: liger_rms_norm
+
+## Classification
+- tier: 2
+- tier_description: reduction
+- closest_existing_kernel: layer_norm (similar pattern with different normalization)
+
+## Forward Pass
+- forward_inputs:
+  - X: shape (B, T, H), input tensor
+  - W: shape (H,), weight tensor
+  - eps: float, epsilon for numerical stability
+  - offset: float, weight offset (0.0 for Llama, 1.0 for Gemma)
+  - casting_mode: str, "llama" | "gemma" | "none"
+- forward_outputs:
+  - Y: shape (B, T, H), normalized output
+- forward_computation: Y = (X / RMS(X)) * (W + offset), RMS = sqrt(mean(X^2) + eps)
+- precision_sensitive_ops: [rsqrt]
+
+## Backward Pass
+- backward_saved_tensors: [X, W, RSTD] — RSTD cached from forward to avoid recomputation
+- backward_recompute: none (RSTD is expensive to recompute)
+- gradient_formulas:
+  - dX: rstd * (dY*(W+offset) - (1/N) * rstd^2 * dot(dY*(W+offset), X) * X)
+  - dW: sum over (B,T) of dY * (X * rstd)
+
+## Tiling Strategy
+- grid_dimensions: 1D
+- grid_description: forward uses one program per row `(n_rows,)`, backward uses SM-based partitioning `(sm_count,)` with `rows_per_program`
+- block_size_source: calculate_settings(n_cols)
+- needs_sm_parallelism: true (for dW reduction — each SM accumulates partial dW, then summed)
+
+## Module Parameters
+- module_init_params:
+  - hidden_size: int
+  - eps: float = 1e-6
+  - offset: float = 0.0
+  - casting_mode: str = "llama"
+  - init_fn: str = "ones"
+  - in_place: bool = True
+  - elementwise_affine: bool = True
+- learnable_params:
+  - weight: shape (hidden_size,), init ones or zeros
+
+## Benchmarking
+- benchmark_variable: hidden_size
+- benchmark_x_label: "hidden_size"
+- benchmark_x_values_suggestion: [1024, 2048, 4096, 8192, 16384]
+- benchmark_providers: ["liger", "huggingface"]
+- benchmark_fixed_config: {M: 4096, eps: 1e-6, dtype: torch.float32}
+
+## Key Patterns
+
+- **RSTD caching**: Forward computes and stores `rstd = rsqrt(mean(X^2) + eps)` — 1 value per row, tiny memory cost, saves 4 ops in backward
+- **SM-based backward**: Weight gradient needs reduction across all rows. Each SM processes `rows_per_program` rows and accumulates into `_dW[sm_id, :]`. Final `dW = _dW.sum(dim=0)`
+- **Casting modes as constexpr**: `casting_mode` is `tl.constexpr` so the compiler eliminates dead branches
+- **In-place backward option**: `in_place=True` writes dX into dY tensor to save memory. Set `False` when dY is needed elsewhere (e.g., Gemma2 residual)
+- **Two kernel variants**: Row-wise for `BLOCK_SIZE > 256 or n_rows < 4096*8`, block-wise otherwise (processes `BLOCK_ROW=16` rows per program for better GPU utilization)
@@ -0,0 +1,56 @@
+# Kernel Profile: SwiGLU (Tier 1 — Element-wise)
+
+## Identity
+- operation_name: swiglu
+- function_class_name: LigerSiLUMulFunction
+- module_class_name: LigerSwiGLUMLP
+- functional_name: liger_swiglu
+
+## Classification
+- tier: 1
+- tier_description: element-wise
+- closest_existing_kernel: geglu (same structure, different activation)
+
+## Forward Pass
+- forward_inputs:
+  - a: shape (B, T, H), gate projection output
+  - b: shape (B, T, H), up projection output
+- forward_outputs:
+  - c: shape (B, T, H), silu(a) * b
+- forward_computation: c = silu(a) * b, where silu(x) = x * sigmoid(x)
+- precision_sensitive_ops: [sigmoid]
+
+## Backward Pass
+- backward_saved_tensors: [a, b] (reshaped to 2D in forward wrapper)
+- backward_recompute: recompute silu(a) and sigmoid(a) in backward
+- gradient_formulas:
+  - da: dc * (silu(a) * (1 - sigmoid(a)) + sigmoid(a)) * b
+  - db: dc * silu(a)
+
+## Tiling Strategy
+- grid_dimensions: 1D
+- grid_description: one program per row, `(n_rows,)`
+- block_size_source: calculate_settings(n_cols)
+- needs_sm_parallelism: false
+
+## Module Parameters
+- module_init_params:
+  - config: HuggingFace model config object
+- learnable_params:
+  - gate_proj: Linear(hidden_size, intermediate_size, bias=False)
+  - up_proj: Linear(hidden_size, intermediate_size, bias=False)
+  - down_proj: Linear(intermediate_size, hidden_size, bias=False)
+
+## Benchmarking
+- benchmark_variable: hidden_size
+- benchmark_x_label: "hidden_size"
+- benchmark_x_values_suggestion: [1024, 2048, 4096, 8192, 16384]
+- benchmark_providers: ["liger", "torch", "torch_compile"]
+- benchmark_fixed_config: {BT: 4096, dtype: torch.bfloat16}
+
+## Key Patterns
+
+- **Recomputation over saving**: Forward saves `a, b` but backward recomputes `sigmoid(a)` and `silu(a)` — saves memory, sigmoid is cheap
+- **In-place backward**: Writes gradients directly to `a_ptr` and `b_ptr` (the saved tensors) — saves allocation
+- **Float32 for sigmoid**: `a_row` cast to `tl.float32` before `tl.sigmoid`, result cast back via `.cast(b_row.dtype)`
+- **No intermediate allocations**: Forward kernel writes directly to output `c`; backward kernel overwrites saved `a, b`