diff --git a/‎.claude/skills/liger-autopatch/SKILL.md‎
Lines changed: 55 additions & 0 deletions b/‎.claude/skills/liger-autopatch/SKILL.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎.claude/skills/liger-autopatch/code-generator.md‎
Lines changed: 86 additions & 0 deletions b/‎.claude/skills/liger-autopatch/code-generator.md‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎.claude/skills/liger-autopatch/decision-matrix.md‎
Lines changed: 130 additions & 0 deletions b/‎.claude/skills/liger-autopatch/decision-matrix.md‎
Lines changed: 130 additions & 0 deletions
diff --git a/‎.claude/skills/liger-autopatch/examples/gemma-profile.md‎
Lines changed: 80 additions & 0 deletions b/‎.claude/skills/liger-autopatch/examples/gemma-profile.md‎
Lines changed: 80 additions & 0 deletions
@@ -0,0 +1,55 @@
+---
+name: liger-autopatch
+description: "Adds Liger Kernel support for a new HuggingFace Transformers model. Generates lce_forward, monkey-patch function, tests, and README entry. Use when adding a new model to Liger Kernel, when a user asks to patch an unsupported model, or when extending MODEL_TYPE_TO_APPLY_LIGER_FN."
+---
+
+# Liger Auto-Patch
+
+Adds Liger Kernel optimization support for a new HuggingFace model through a 3-stage pipeline with human review between stages.
+
+## Pipeline
+
+### Stage 1: Analyze
+
+Spawn a **Model Analyzer** agent (read [model-analyzer.md](model-analyzer.md)).
+
+The agent reads the HF `modeling_*.py` source and produces a **model profile** answering 12 architectural questions from [decision-matrix.md](decision-matrix.md).
+
+**Human checkpoint:** Present the profile. Confirm before proceeding.
+
+### Stage 2: Generate
+
+Spawn a **Code Generator** agent (read [code-generator.md](code-generator.md)).
+
+Generates/modifies up to 13 files:
+
+1. `src/liger_kernel/transformers/model/{model}.py` — NEW lce_forward
+2. `src/liger_kernel/transformers/monkey_patch.py` — MODIFY
+3. `src/liger_kernel/transformers/__init__.py` — MODIFY
+4. `src/liger_kernel/transformers/model/output_classes.py` — MODIFY if needed
+5. `test/transformers/test_monkey_patch.py` — MODIFY
+6. `test/convergence/bf16/test_mini_models.py` — MODIFY (FLCE path)
+7. `test/convergence/bf16/test_mini_models_with_logits.py` — MODIFY (non-FLCE path)
+8. `test/convergence/fp32/test_mini_models.py` — MODIFY (FLCE path)
+9. `test/convergence/fp32/test_mini_models_with_logits.py` — MODIFY (non-FLCE path)
+10. `test/convergence/bf16/test_mini_models_multimodal.py` — MODIFY if VL model
+11. `test/convergence/fp32/test_mini_models_multimodal.py` — MODIFY if VL model
+12. `test/utils.py` — MODIFY
+13. `README.md` — MODIFY
+
+**Human checkpoint:** Present changes for review.
+
+### Stage 3: Validate
+
+Spawn a **Validator** agent (read [validator.md](validator.md)).
+
+Runs instance patching test, convergence test, and lint check. Retries up to 3 times on failure.
+
+**Human checkpoint:** Report final test results.
+
+## Reference Files
+
+- [decision-matrix.md](decision-matrix.md) — 12 architectural decisions to resolve per model
+- [examples/llama-profile.md](examples/llama-profile.md) — Reference profile for standard dense model
+- [examples/gemma-profile.md](examples/gemma-profile.md) — Reference profile showing GeGLU + offset variant
+- Templates in [templates/](templates/) — Code generation patterns for each file type
@@ -0,0 +1,86 @@
+# Code Generator Agent
+
+Takes a confirmed model profile and generates all files to add Liger Kernel support.
+
+## Pre-Requisites
+
+Before generating, read the reference implementation closest to this model:
+- Dense → `src/liger_kernel/transformers/model/llama.py`
+- MoE → `src/liger_kernel/transformers/model/mixtral.py`
+- Vision-Language → `src/liger_kernel/transformers/model/qwen2_vl.py`
+- Gemma-family → `src/liger_kernel/transformers/model/gemma.py`
+
+Also read the corresponding patching function in `monkey_patch.py` and the templates in [templates/](templates/).
+
+## Files to Generate
+
+### 1. `src/liger_kernel/transformers/model/{model_type}.py` (NEW)
+
+The `lce_forward` function. See [templates/lce-forward-dense.md](templates/lce-forward-dense.md) or [templates/lce-forward-moe.md](templates/lce-forward-moe.md).
+
+Key rules:
+- Match the exact forward signature from HF's `ForCausalLM.forward`
+- Use `lce_maybe_trainable_lm_head` from `llama.py` (shared PEFT/FSDP utility)
+- If model needs custom loss args (e.g., softcapping), write a local helper instead
+
+### 2. `src/liger_kernel/transformers/monkey_patch.py` (MODIFY)
+
+Three changes — see [templates/monkey-patch-fn.md](templates/monkey-patch-fn.md):
+
+**A.** Add lce_forward import (~line 18-28):
+```python
+from liger_kernel.transformers.model.{model_type} import lce_forward as {model_type}_lce_forward
+```
+
+**B.** Add `apply_liger_kernel_to_{model_type}` function with both class-level and instance-level patching paths.
+
+**C.** Add entry to `MODEL_TYPE_TO_APPLY_LIGER_FN` dict (~line 3067).
+
+### 3. `src/liger_kernel/transformers/__init__.py` (MODIFY)
+
+Add the function in three locations (maintain alphabetical order):
+- `TYPE_CHECKING` block
+- `__getattr__` monkey_patch_symbols set
+- `__all__` list extension
+
+### 4. `src/liger_kernel/transformers/model/output_classes.py` (MODIFY if needed)
+
+Only for models needing custom output (MoE with `aux_loss`, VL with `rope_deltas`). Follow the existing guarded-import pattern in the file.
+
+### 5. `test/transformers/test_monkey_patch.py` (MODIFY)
+
+See [templates/test-instance-patch.md](templates/test-instance-patch.md). Add availability checker + skipif-decorated test function using `inspect.getsource()` assertions.
+
+### 6. Convergence tests (MODIFY multiple files)
+
+See [templates/test-convergence.md](templates/test-convergence.md). Every model needs entries in multiple convergence test files:
+
+**All text models (dense + MoE)** — add to these 4 files:
+- `test/convergence/bf16/test_mini_models.py` — FLCE path, bf16
+- `test/convergence/bf16/test_mini_models_with_logits.py` — non-FLCE path (tests RMSNorm/SwiGLU/RoPE only), bf16
+- `test/convergence/fp32/test_mini_models.py` — FLCE path, fp32
+- `test/convergence/fp32/test_mini_models_with_logits.py` — non-FLCE path, fp32
+
+**Vision-language models** — also add to these 2:
+- `test/convergence/bf16/test_mini_models_multimodal.py`
+- `test/convergence/fp32/test_mini_models_multimodal.py`
+
+Each file needs: imports, availability guard, `MiniModelConfig` entry in `MINI_MODEL_SETUPS` dict, and a `pytest.param` entry in the parametrize block. The `MiniModelConfig` entry is identical across all files for the same model. The `pytest.param` tolerances differ — use bf16 tolerances (looser) for bf16 files and fp32 tolerances (tighter) for fp32 files. Copy tolerance values from a similar existing model (e.g., Llama for dense, Mixtral for MoE).
+
+### 7. `test/utils.py` (MODIFY)
+
+Add `revert_liger_kernel_to_{model_type}` function that reloads the modeling module.
+
+### 8. `README.md` (MODIFY)
+
+Add row to the Patching table under "### Patching":
+```
+| {ModelName} | `liger_kernel.transformers.apply_liger_kernel_to_{model_type}` | {Supported Operations} |
+```
+
+## Code Style
+
+- Line length 120, double quotes, single imports sorted with isort
+- Follow exact patterns from existing code — do not innovate on style
+- When modifying existing files, insert new entries in **alphabetical order** alongside similar existing entries. Never append to the end of a section — find the correct alphabetical position.
+- After generating all files, run `make checkstyle` to verify formatting. If it fails, run `ruff check . --fix && ruff format .` to auto-fix, then verify with `make checkstyle` again.
@@ -0,0 +1,130 @@
+# Decision Matrix
+
+When analyzing a HuggingFace model for Liger Kernel support, you must resolve these 12 architectural decisions by reading the model's `modeling_*.py` source code.
+
+## 1. Norm Type
+
+**Question:** Does the model use RMSNorm, LayerNorm, or both?
+
+**How to detect:**
+- Search for `class *RMSNorm` in the modeling file → RMSNorm
+- Search for `nn.LayerNorm` usage → LayerNorm
+- Multimodal models often use both (RMSNorm for text, LayerNorm for vision)
+
+**Liger mapping:** `LigerRMSNorm` or `LigerLayerNorm`
+
+## 2. RMSNorm Casting Mode
+
+**Question:** How does the model handle dtype casting during normalization?
+
+**How to detect:** Read the RMSNorm forward method:
+- Casts input to fp32, computes variance, casts back → `"gemma"`
+- Computes variance in fp32 only (input stays original dtype) → `"llama"`
+- No casting at all → `"none"`
+
+**Default:** `"llama"` (most common)
+
+## 3. RMSNorm Offset
+
+**Question:** Does the weight have a +1.0 offset?
+
+**How to detect:** In the RMSNorm forward, look for `(1 + self.weight)` or `self.weight + 1`:
+- Present → `offset=1.0` (Gemma family)
+- Absent → `offset=0.0` (most models)
+
+## 4. RMSNorm In-Place
+
+**Question:** Can the backward pass modify dY in-place?
+
+**How to detect:** Check if the model has two sequential norm layers with a residual connection between them (like Gemma2's `pre_feedforward_layernorm` + `post_feedforward_layernorm`):
+- Sequential norms with residual → `in_place=False`
+- Otherwise → `in_place=True`
+
+## 5. MLP Activation Type
+
+**Question:** What activation function does the gated MLP use?
+
+**How to detect:** Read the MLP class forward method:
+- `silu` or `F.silu` → SwiGLU → `LigerSwiGLUMLP`
+- `gelu` or `gelu_new` or `gelu_fast` → GeGLU → `LigerGEGLUMLP`
+- Phi3-style (single gate+up projection split) → `LigerPhi3SwiGLUMLP`
+
+**Also check:** The config's `hidden_act` field.
+
+## 6. Dense vs MoE
+
+**Question:** Is the model dense, MoE, or hybrid (some layers dense, some MoE)?
+
+**How to detect:**
+- Search for `Expert`, `MoE`, `SparseMoe`, `TopK` routing classes
+- Check if decoder layers have a `block_sparse_moe` or `experts` attribute
+- Hybrid: check for `is_moe_layer` or conditional MoE per-layer
+
+**Liger mapping:**
+- Dense → standard patching
+- MoE (transformers v5) → `LigerExperts`
+- MoE (transformers v4) → `LigerBlockSparseTop2MLP`
+- Qwen3-style MoE → `LigerQwen3MoeSwiGLUMLP`
+
+## 7. Vision Components
+
+**Question:** Does the model have a vision encoder?
+
+**How to detect:**
+- Check for `pixel_values` in the `forward` signature
+- Look for a separate vision model class (e.g., `*VisionModel`)
+- Check config for `vision_config` or `text_config` sub-configs
+
+**If yes:**
+- Vision encoder norms are usually `nn.LayerNorm` → patch with `LigerLayerNorm`
+- Text and vision must be patched separately
+
+## 8. RoPE Variant
+
+**Question:** What type of positional embedding does the model use?
+
+**How to detect:** Search for the `apply_rotary_pos_emb` function:
+- Standard (q, k, cos, sin) → `liger_rotary_pos_emb` (rope=True)
+- Llama4-style → `liger_llama4_text_rotary_pos_emb`
+- Qwen2VL MRoPE → `liger_multimodal_rotary_pos_emb`
+- No rotary embedding or custom variant → `rope=False`
+
+## 9. Output Class
+
+**Question:** What return type does the model's ForCausalLM.forward use?
+
+**How to detect:** Read the return statement and type annotation:
+- Standard → `LigerCausalLMOutputWithPast`
+- MoE (has `aux_loss`) → `LigerMoeCausalLMOutputWithPast`
+- Custom VL output → create model-specific output class in `output_classes.py`
+
+## 10. Hidden State Access
+
+**Question:** How does the model access hidden states from base model output?
+
+**How to detect:** In the ForCausalLM.forward, after calling `self.model(...)`:
+- `outputs[0]` → most models (Llama, Mistral, Gemma, etc.)
+- `outputs.last_hidden_state` → Phi3, Qwen3.5 MoE, some newer models
+
+## 11. Logit Softcapping
+
+**Question:** Does the model apply softcapping to logits before loss?
+
+**How to detect:** Check config for `final_logit_softcapping`:
+- Present → pass `final_logit_softcapping=self.config.final_logit_softcapping` to `LigerForCausalLMLoss`
+- Absent → no softcapping (most models)
+- **VL models:** Config path may be `self.config.text_config.final_logit_softcapping` instead of `self.config.final_logit_softcapping`. Check whether the model uses a composite config with `text_config` sub-config.
+
+**Models with softcapping:** Gemma2, Gemma3
+
+## 12. Decoder Layer Norm Names
+
+**Question:** What are the attribute names of norm layers in each decoder layer?
+
+**How to detect:** Read the decoder layer class `__init__`:
+- Standard: `input_layernorm`, `post_attention_layernorm`
+- Gemma2 extra: `pre_feedforward_layernorm`, `post_feedforward_layernorm`
+- GLM4: `post_self_attn_layernorm`, `post_mlp_layernorm`
+- Some models: `q_norm`, `k_norm` on self_attn
+
+Also check the final norm on the base model (usually `model.norm` or `model.final_layernorm`).
@@ -0,0 +1,80 @@
+# Model Profile: Gemma
+
+This profile demonstrates the key differences from Llama: GeGLU activation, RMSNorm offset, and Gemma casting mode.
+
+## Identity
+- model_type: gemma
+- causal_lm_class: GemmaForCausalLM
+- base_model_class: GemmaModel
+- base_model_prefix: "model"
+- modeling_module: transformers.models.gemma.modeling_gemma
+- config_module: transformers.models.gemma.configuration_gemma
+
+## Normalization
+- norm_class: GemmaRMSNorm
+- norm_type: RMSNorm
+- casting_mode: gemma (everything cast to fp32, then computed, then cast back)
+- offset: 1.0 (weight uses `1 + self.weight` pattern)
+- in_place: true
+- final_norm_attr: model.norm
+- decoder_norm_attrs:
+  - input_layernorm
+  - post_attention_layernorm
+- attn_norm_attrs: none
+
+## MLP
+- mlp_class: GemmaMLP
+- activation: gelu (uses GELU activation, not SiLU)
+- liger_mlp_class: LigerGEGLUMLP
+- gate_proj_attr: gate_proj
+- up_proj_attr: up_proj
+- down_proj_attr: down_proj
+
+## Structure
+- type: dense
+- moe_expert_class: n/a
+- moe_router_class: n/a
+- shared_expert: false
+
+## Positional Embedding
+- rope_type: standard
+- rope_function: apply_rotary_pos_emb
+
+## Output
+- output_class: LigerCausalLMOutputWithPast
+- hidden_state_access: outputs[0]
+- has_logit_softcapping: false
+- softcapping_config_attr: none
+
+## Vision
+- has_vision: false
+
+## Forward Signature
+Same as Llama — no extra parameters.
+
+## Mini Model Config
+```python
+GemmaConfig(
+    hidden_size=32,
+    intermediate_size=64,
+    num_hidden_layers=2,
+    num_attention_heads=2,
+    num_key_value_heads=2,
+    vocab_size=1024,
+    rms_norm_eps=1e-6,
+    hidden_activation="gelu_pytorch_tanh",
+)
+```
+
+## Key Differences from Llama
+
+1. **Activation**: Uses `geglu` parameter (not `swiglu`) in the patch function
+2. **RMSNorm**: Requires `offset=1.0` and `casting_mode="gemma"`
+3. **MLP class**: `LigerGEGLUMLP` instead of `LigerSwiGLUMLP`
+4. **Patching uses partial**: `_patch_rms_norm_module_for_gemma = partial(_patch_rms_norm_module, casting_mode="gemma", offset=1.0)`
+
+## Gemma2 Additional Differences
+- `in_place: false` (residual between sequential norms)
+- Extra norm layers: `pre_feedforward_layernorm`, `post_feedforward_layernorm`
+- Has `final_logit_softcapping` in config
+- Uses `LigerRMSNormForGemma2` variant