Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 219 additions & 0 deletions .github/skills/new-model/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
---
name: new-model
description: >-
Add support for a new model architecture to onnxruntime-genai's Python model
builder, and debug numerical parity / quality issues in the exported ONNX
model. Use when asked to "add a new model", "support <arch> in the model
builder", export a HuggingFace model to ONNX GenAI format, or when an exported
model produces garbage/incoherent output, fails ShapeInferenceError, or has
low logits correlation vs the reference PyTorch model. Covers builder
dispatch, the State/builder pattern, MoE/QMoE quantization encoding, and a
systematic parity-debugging workflow.
---

# Adding a New Model to onnxruntime-genai

This skill explains how to (1) add a new model architecture to the Python model
builder and (2) debug numerical parity problems in the exported model.

## 1. Orientation — key files

- `src/python/py/models/builder.py` — top-level entry. `create_model()` dispatches
on `config.architectures[0]` (the HF architecture string) to a concrete builder
class. New architectures are wired in here.
- `src/python/py/models/builders/base.py` — the `Model` base class. Contains all
shared graph-construction helpers (`make_*` methods), attention, MLP, MoE/QMoE,
rotary embedding, KV-cache, and quantization logic. **Read this first** — most
new models reuse 90%+ of it.
- `src/python/py/models/builders/<family>.py` — per-family subclasses (e.g.
`qwen.py`, `llama.py`, `gptoss.py`). A new model usually subclasses an existing
family builder or `Model` directly and overrides only what differs.
- `src/python/py/models/README.md` and `DESIGN.md` — supported models, extra
options, and design principles. Honor the `.github/instructions/python-model-builder.instructions.md`
rules (prefer `self.make_<op>` wrappers; reduce duplication via base-class reuse).

The C++ runtime side (only needed for new model *types*, not new arches that reuse
an existing type):
- `src/models/model_type.h` — canonical list of LLM/VLM/ALM type strings.
- `src/config.cpp` / `src/config.h` — `genai_config.json` schema.
- `src/models/` — `State` subclasses, processors, position inputs, caches.

## 2. Decide: is this really a *new* architecture?

Before writing any code, compare the new model's `config.json` to existing ones:

```bash
python -c "from transformers import AutoConfig; c=AutoConfig.from_pretrained('<hf_id>'); print(c.architectures, c.model_type)"
```

- If `architectures[0]` already matches an existing dispatch branch in
`builder.py`, the model may build with **zero code changes** — just try it.
(Example: a point-release that keeps the same arch class string as its
predecessor. Always test this hypothesis first; it saves days.)
- If only a few hyperparameters differ (rope theta, layer counts, head dims,
MoE expert counts), the existing builder usually flows them through `config`
unchanged. Add a new dispatch branch only when graph *structure* differs.

## 3. Adding the builder

1. **Dispatch**: add an `elif config.architectures[0] == "<ArchString>":` branch
in `create_model()` that instantiates your builder class.
2. **Builder class**: subclass the closest existing family builder. Override
`__init__` (to set arch-specific attrs) and the specific `make_*` methods that
differ (e.g. `make_attention`, `make_mlp`, `make_moe`, rotary cache builders).
3. **Reuse**: route every emitted node through the `self.make_<op>` wrappers so
shapes/values are registered. Do not hand-roll `make_node` + `make_value`.
4. **Weights**: the base loader iterates HF weights by name. If the new model
packs/repacks weights (e.g. interleaving gate/up for fused SwiGLU, or splitting
QKV), do the repack in the builder and keep names aligned with the op inputs.

### MoE / QMoE specifics (high-bug-risk area)

- `make_moe_op` emits `MoE` (fp16) or `QMoE` (int4/int8). `make_qmoe_weights`
quantizes and packs each expert weight `[N, K]`.
- **CUDA QMoE weight packing (requires `onnxruntime-gpu >= 1.27`):** the kernel is a
CUTLASS fpA_intB mixed GEMM that consumes offline-prepacked weights. The preferred
recipe (no TensorRT-LLM dependency):
1. quantize block-wise with `_symmetric_blockwise_quantize(weights, block_size)`;
2. call `pack_weights_for_cuda_mixed_gemm` (available in `onnxruntime-gpu >= 1.27`)
to produce the CUTLASS-interleaved layout the kernel expects;
3. stack experts → weights `[E, K, N/pack]`, scales `[E, N, K/block]`.
- **`qmoe_block_size` supports values 16, 32, 64, 128, or 256** (default 128 for
CUDA/TRT-RTX, 32 otherwise). `_symmetric_blockwise_quantize` pads the last
dimension to a multiple of `block_size` automatically.
- Emit the activation attributes that match the model's activation: for standard
SwiGLU `silu(gate)*up`, use `activation_alpha=1.0, activation_beta=0.0` and no
`swiglu_limit`. GPT-OSS-style clamped SwiGLU uses `alpha=1.702, beta=1.0,
swiglu_limit=7.0`. Wrong activation attrs silently degrade parity.
- `swiglu_fusion=1` expects gate/up **interleaved** `[g0,u0,g1,u1,...]`; repack HF
concatenated `[gate|up]` accordingly before quantizing.
- The router (gate) MatMul is **external** to QMoE and runs in fp16. To keep
routing exact under int4, add the router and any shared-expert-gate MatMul node
names to `quant_attrs["nodes_to_exclude"]`.

## 4. Exporting

Models are usually exported via an Olive recipe (`olive run --config <text.json>`)
or directly:

```bash
python -m onnxruntime_genai.models.builder -m <hf_id> -o <out_dir> -p int4 -e cuda \
--extra_options int4_block_size=128
```

After editing the builder in `src/python/py/models/`, the **installed** package is
what runs. Copy your edits to every install location, or reinstall:

```bash
SRC=src/python/py/models/builders/base.py
for d in $(python -c "import onnxruntime_genai,os;print(os.path.join(os.path.dirname(onnxruntime_genai.__file__),'models','builders'))"); do cp "$SRC" "$d/"; done
```

When re-exporting with Olive, clear its cache so the modelbuilder pass actually
re-runs: `rm -rf .olive-cache/<workflow>`.

## 5. Debugging parity / garbage output (systematic workflow)

When the model loads but generates garbage or has low logits correlation, isolate
the error layer-by-layer and component-by-component. **Do not guess** — bisect.

### Metrics: how `corr` and `relL2` are computed

Throughout this workflow two scalar metrics quantify how close a candidate tensor
`a` is to a reference tensor `b` (same shape, flattened):

- **`corr`** — Pearson correlation of the flattened tensors. Measures *shape /
direction* agreement and is scale-invariant, so it stays high even if `a` is a
uniformly scaled version of `b`. `corr ≈ 1.0` is good; `corr ≈ 0` means
uncorrelated (random) output.
- **`relL2`** — relative L2 (Euclidean) error, `‖a − b‖₂ / ‖b‖₂`. Measures
*magnitude* error and is NOT scale-invariant. `relL2 ≈ 0` is perfect; `relL2 ≈
√2 (~1.41)` for zero-mean tensors means `a` is effectively random w.r.t. `b`.

Use them together: high `corr` with high `relL2` points at a **scale/sign bug**
(right direction, wrong magnitude — e.g. dropped scale sign); low `corr` means the
values are scrambled (a layout/encoding bug). Example helper:

```python
import numpy as np
import torch

def corr(a, b):
a = np.asarray(a, dtype=np.float64).ravel()
b = np.asarray(b, dtype=np.float64).ravel()
return float(np.corrcoef(a, b)[0, 1])

def relL2(a, b):
a = torch.as_tensor(a, dtype=torch.float32)
b = torch.as_tensor(b, dtype=torch.float32)
return float(torch.norm(a - b) / (torch.norm(b) + 1e-9))

# Example: compare an ONNX op output against a PyTorch reference
print(f"corr={corr(onnx_out, ref):.4f} relL2={relL2(onnx_out, ref):.4f}")
```

When comparing next-token logits, additionally check that the **argmax matches**
(`int(a.argmax()) == int(b.argmax())`) — argmax agreement is what actually
determines greedy-decoding correctness.

### Step 0 — Establish ground truth
Run the HF model in PyTorch and capture next-token logits for a fixed prompt.
Good parity: ONNX-vs-HF next-token correlation ≥ 0.99 and matching argmax.

### Step 1 — fp16 first (separate structure from quantization)
Export an **fp16 (non-quantized)** variant and compare to HF.
- fp16 corr ≈ 1.0 ⟹ the graph, kernels, rotary embeddings, caches, and runtime
are all correct; any remaining problem is **purely quantization**.
- fp16 corr still low ⟹ a **structural** bug (wrong rotary, wrong cache wiring,
wrong attention/MLP graph, transposed weights). Fix this before touching quant.

### Step 2 — Layer-by-layer residual probe
Add the per-layer residual tensors (e.g. `/model/layers.{n}/input_layernorm/output`)
as extra graph outputs in **both** the int4 and fp16 models, feed identical
`inputs_embeds`, and compare each layer's `corr`/`relL2`. Find the first layer
where correlation collapses and which **layer type** (attention vs MoE vs linear-
attention) degrades fastest.

### Step 3 — Sub-component isolation
Within the worst layer, probe sub-tensors (attn output, post-attn residual, MoE
output, layer residual) int4-vs-fp16 with the **same inputs**. This pinpoints the
offending op (e.g. `moe_out relL2 0.4` ⟹ the QMoE path).

### Step 4 — Op-level standalone test
Reproduce the suspect op in isolation with small synthetic weights and compare:
- the ORT op output **vs a pure-PyTorch reference built from the ORIGINAL
(un-quantized) weights**. **Crucial:** do NOT validate a quantized op only
against a torch reference that uses the *same* (de)quantization — a sign/scale
bug can be self-consistent and pass while still corrupting real weights. Always
anchor against the original full-precision weights.
- Sweep encoding conventions (transpose y/n, offset 8/zp, signed vs abs scale,
unpack order) and pick the one that reconstructs the original weights with high
corr and low relL2.

### Step 5 — Weight-fidelity sanity check
Quantize→dequantize a real weight matrix and measure `relL2` vs the original.
`relL2 ≈ √2` (~1.41) means the dequant is **random** (encoding bug), not merely
lossy. Healthy int4 RTN is ~0.05–0.15. This quickly distinguishes an encoding bug
from genuine quantization quality loss.

### Common real bugs (seen in practice)
- **abs() on signed quant scales** → garbage that looks like "int4 quality loss".
- **Wrong CUTLASS `force_arch`** in prepacking → finite but wrong output.
- **Duplicate node/value names** (e.g. two `.../Mul` outputs in a shared-expert
subgraph) → `ShapeInferenceError` / wrong tensor picked downstream. Give each
emitted node a unique name.
- **Missing/incorrect activation attrs** (`activation_alpha/beta`, `swiglu_limit`)
→ degraded but not random.
- **Quantizing routing-critical tiny matmuls** (router, gate) → wrong expert
selection; exclude them from int4.
- **Running Python from a directory that shadows the installed package** (e.g. a
local `onnxruntime/` folder) → stale code. Run from a neutral dir.

## 6. Validation checklist
- [ ] fp16 export: next-token corr ≥ 0.99 vs HF, coherent greedy text.
- [ ] int4 export: per-layer corr stays high; final next-token corr ≥ ~0.95.
- [ ] Coherent end-to-end generation (text and, for VLMs, image prompts).
- [ ] Memory footprint fits the target GPU.
- [ ] TTFT / tokens-per-second benchmarked.
- [ ] Builder edits copied to all install locations; Olive cache cleared before
re-export.
Loading