-
Notifications
You must be signed in to change notification settings - Fork 312
Add agent skill for creating new model #2206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,229 @@ | ||
| --- | ||
| name: new-model | ||
| description: >- | ||
| Add support for a new model architecture to onnxruntime-genai's Python model | ||
| builder, and debug numerical parity / quality issues in the exported ONNX | ||
| model. Use when asked to "add a new model", "support <arch> in the model | ||
| builder", export a HuggingFace model to ONNX GenAI format, or when an exported | ||
| model produces garbage/incoherent output, fails ShapeInferenceError, or has | ||
| low logits correlation vs the reference PyTorch model. Covers builder | ||
| dispatch, the State/builder pattern, MoE/QMoE quantization encoding, and a | ||
| systematic parity-debugging workflow. | ||
| --- | ||
|
|
||
| # Adding a New Model to onnxruntime-genai | ||
|
|
||
| This skill explains how to (1) add a new model architecture to the Python model | ||
| builder and (2) debug numerical parity problems in the exported model. | ||
|
|
||
| ## 1. Orientation — key files | ||
|
|
||
| - `src/python/py/models/builder.py` — top-level entry. `create_model()` dispatches | ||
| on `config.architectures[0]` (the HF architecture string) to a concrete builder | ||
| class. New architectures are wired in here. | ||
| - `src/python/py/models/builders/base.py` — the `Model` base class. Contains all | ||
| shared graph-construction helpers (`make_*` methods), attention, MLP, MoE/QMoE, | ||
| rotary embedding, KV-cache, and quantization logic. **Read this first** — most | ||
| new models reuse 90%+ of it. | ||
| - `src/python/py/models/builders/<family>.py` — per-family subclasses (e.g. | ||
| `qwen.py`, `llama.py`, `gptoss.py`). A new model usually subclasses an existing | ||
| family builder or `Model` directly and overrides only what differs. | ||
| - `src/python/py/models/README.md` and `DESIGN.md` — supported models, extra | ||
| options, and design principles. Honor the `.github/instructions/python-model-builder.instructions.md` | ||
| rules (prefer `self.make_<op>` wrappers; reduce duplication via base-class reuse). | ||
|
|
||
| The C++ runtime side (only needed for new model *types*, not new arches that reuse | ||
| an existing type): | ||
| - `src/models/model_type.h` — canonical list of LLM/VLM/ALM type strings. | ||
| - `src/config.cpp` / `src/config.h` — `genai_config.json` schema. | ||
| - `src/models/` — `State` subclasses, processors, position inputs, caches. | ||
|
|
||
| ## 2. Decide: is this really a *new* architecture? | ||
|
|
||
| Before writing any code, compare the new model's `config.json` to existing ones: | ||
|
|
||
| ```bash | ||
| python -c "from transformers import AutoConfig; c=AutoConfig.from_pretrained('<hf_id>'); print(c.architectures, c.model_type)" | ||
| ``` | ||
|
|
||
| - If `architectures[0]` already matches an existing dispatch branch in | ||
| `builder.py`, the model may build with **zero code changes** — just try it. | ||
| (Example: a point-release that keeps the same arch class string as its | ||
| predecessor. Always test this hypothesis first; it saves days.) | ||
| - If only a few hyperparameters differ (rope theta, layer counts, head dims, | ||
| MoE expert counts), the existing builder usually flows them through `config` | ||
| unchanged. Add a new dispatch branch only when graph *structure* differs. | ||
|
|
||
| ## 3. Adding the builder | ||
|
|
||
| 1. **Dispatch**: add an `elif config.architectures[0] == "<ArchString>":` branch | ||
| in `create_model()` that instantiates your builder class. | ||
| 2. **Builder class**: subclass the closest existing family builder. Override | ||
| `__init__` (to set arch-specific attrs) and the specific `make_*` methods that | ||
| differ (e.g. `make_attention`, `make_mlp`, `make_moe`, rotary cache builders). | ||
| 3. **Reuse**: route every emitted node through the `self.make_<op>` wrappers so | ||
| shapes/values are registered. Do not hand-roll `make_node` + `make_value`. | ||
| 4. **Weights**: the base loader iterates HF weights by name. If the new model | ||
| packs/repacks weights (e.g. interleaving gate/up for fused SwiGLU, or splitting | ||
| QKV), do the repack in the builder and keep names aligned with the op inputs. | ||
|
|
||
| ### MoE / QMoE specifics (high-bug-risk area) | ||
|
|
||
| - `make_moe_op` emits `MoE` (fp16) or `QMoE` (int4/int8). `make_qmoe_weights` | ||
| quantizes and packs each expert weight `[N, K]`. | ||
| - **CUDA QMoE weight encoding (critical):** the kernel is a CUTLASS fpA_intB | ||
| mixed GEMM that consumes **offline-prepacked** weights. The proven recipe | ||
| (see `_cutlass_prepacked_blockwise_quantize` in `base.py`): | ||
| 1. transpose weight to `[K, N]`; | ||
| 2. `onnxruntime...quantize_matmul_4bits(qw, w_T, scales, zp, block, N, K, is_symmetric=True)`; | ||
| 3. **keep the SIGNED scales** — do NOT `abs()` them. The kernel dequantizes as | ||
| `(q - 8) * scale`, and `quantize_matmul_4bits` folds the block-anchor sign | ||
| into the scale. Taking `abs()` corrupts every block whose anchor is negative | ||
| and produces garbage (this is a real bug that masquerades as "int4 quality | ||
| loss"); | ||
| 4. `pack_weights_for_cuda_mixed_gemm(qw_reshaped, N, K, bits, force_arch=80)` — | ||
| **always force_arch=80**: all int4 QMoE prepacking assumes the SM80-style | ||
| interleaved layout, which is correct for every SM ≥ 80 (Ampere/Ada/Hopper, | ||
| incl. RTX 4090 = SM89 and H100/H200 = SM90); | ||
| 5. reshape to `[K, N/pack]`. Stack experts → weights `[E, K, N/pack]`, scales | ||
| `[E, N, K/block]`. | ||
| - The QMoE node then uses the **default** `weights_prepacked` (omit the attribute; | ||
| default = prepacked). Do **not** set `weights_prepacked=0` (the raw-weight + | ||
| runtime-PrePack-hook path is finiteness-checked only and is not bit-correct). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The weights_prepacked attribute is a new attribute added in latest onnxruntime: microsoft/onnxruntime#28749. |
||
| - **CUDA QMoE only supports `block_size` 64 or 128.** Assert this in the builder. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is limitation (block size 64/128) of cuda QMoE op. See https://github.com/microsoft/onnxruntime/blob/main/docs/contrib_ops/cuda/moe_qmoe.md for details.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed — updated the bullet to state that |
||
| - Emit the activation attributes that match the model's activation: for standard | ||
| SwiGLU `silu(gate)*up`, use `activation_alpha=1.0, activation_beta=0.0` and no | ||
| `swiglu_limit`. GPT-OSS-style clamped SwiGLU uses `alpha=1.702, beta=1.0, | ||
| swiglu_limit=7.0`. Wrong activation attrs silently degrade parity. | ||
| - `swiglu_fusion=1` expects gate/up **interleaved** `[g0,u0,g1,u1,...]`; repack HF | ||
| concatenated `[gate|up]` accordingly before quantizing. | ||
| - The router (gate) MatMul is **external** to QMoE and runs in fp16. To keep | ||
| routing exact under int4, add the router and any shared-expert-gate MatMul node | ||
| names to `quant_attrs["nodes_to_exclude"]`. | ||
|
|
||
| ## 4. Exporting | ||
|
|
||
| Models are usually exported via an Olive recipe (`olive run --config <text.json>`) | ||
| or directly: | ||
|
|
||
| ```bash | ||
| python -m onnxruntime_genai.models.builder -m <hf_id> -o <out_dir> -p int4 -e cuda \ | ||
| --extra_options int4_block_size=128 | ||
| ``` | ||
|
|
||
| After editing the builder in `src/python/py/models/`, the **installed** package is | ||
| what runs. Copy your edits to every install location, or reinstall: | ||
|
|
||
| ```bash | ||
| SRC=src/python/py/models/builders/base.py | ||
| for d in $(python -c "import onnxruntime_genai,os;print(os.path.join(os.path.dirname(onnxruntime_genai.__file__),'models','builders'))"); do cp "$SRC" "$d/"; done | ||
| ``` | ||
|
|
||
| When re-exporting with Olive, clear its cache so the modelbuilder pass actually | ||
| re-runs: `rm -rf .olive-cache/<workflow>`. | ||
|
|
||
| ## 5. Debugging parity / garbage output (systematic workflow) | ||
|
|
||
| When the model loads but generates garbage or has low logits correlation, isolate | ||
| the error layer-by-layer and component-by-component. **Do not guess** — bisect. | ||
|
|
||
| ### Metrics: how `corr` and `relL2` are computed | ||
|
|
||
| Throughout this workflow two scalar metrics quantify how close a candidate tensor | ||
| `a` is to a reference tensor `b` (same shape, flattened): | ||
|
|
||
| - **`corr`** — Pearson correlation of the flattened tensors. Measures *shape / | ||
| direction* agreement and is scale-invariant, so it stays high even if `a` is a | ||
| uniformly scaled version of `b`. `corr ≈ 1.0` is good; `corr ≈ 0` means | ||
| uncorrelated (random) output. | ||
| - **`relL2`** — relative L2 (Euclidean) error, `‖a − b‖₂ / ‖b‖₂`. Measures | ||
| *magnitude* error and is NOT scale-invariant. `relL2 ≈ 0` is perfect; `relL2 ≈ | ||
| √2 (~1.41)` for zero-mean tensors means `a` is effectively random w.r.t. `b`. | ||
|
|
||
| Use them together: high `corr` with high `relL2` points at a **scale/sign bug** | ||
| (right direction, wrong magnitude — e.g. dropped scale sign); low `corr` means the | ||
| values are scrambled (a layout/encoding bug). Example helper: | ||
|
|
||
| ```python | ||
| import numpy as np | ||
| import torch | ||
|
|
||
| def corr(a, b): | ||
| a = np.asarray(a, dtype=np.float64).ravel() | ||
| b = np.asarray(b, dtype=np.float64).ravel() | ||
| return float(np.corrcoef(a, b)[0, 1]) | ||
|
|
||
| def relL2(a, b): | ||
| a = torch.as_tensor(a, dtype=torch.float32) | ||
| b = torch.as_tensor(b, dtype=torch.float32) | ||
| return float(torch.norm(a - b) / (torch.norm(b) + 1e-9)) | ||
|
|
||
| # Example: compare an ONNX op output against a PyTorch reference | ||
| print(f"corr={corr(onnx_out, ref):.4f} relL2={relL2(onnx_out, ref):.4f}") | ||
| ``` | ||
|
|
||
| When comparing next-token logits, additionally check that the **argmax matches** | ||
| (`int(a.argmax()) == int(b.argmax())`) — argmax agreement is what actually | ||
| determines greedy-decoding correctness. | ||
|
|
||
| ### Step 0 — Establish ground truth | ||
| Run the HF model in PyTorch and capture next-token logits for a fixed prompt. | ||
| Good parity: ONNX-vs-HF next-token correlation ≥ 0.99 and matching argmax. | ||
|
|
||
| ### Step 1 — fp16 first (separate structure from quantization) | ||
| Export an **fp16 (non-quantized)** variant and compare to HF. | ||
| - fp16 corr ≈ 1.0 ⟹ the graph, kernels, rotary embeddings, caches, and runtime | ||
| are all correct; any remaining problem is **purely quantization**. | ||
| - fp16 corr still low ⟹ a **structural** bug (wrong rotary, wrong cache wiring, | ||
| wrong attention/MLP graph, transposed weights). Fix this before touching quant. | ||
|
|
||
| ### Step 2 — Layer-by-layer residual probe | ||
| Add the per-layer residual tensors (e.g. `/model/layers.{n}/input_layernorm/output`) | ||
| as extra graph outputs in **both** the int4 and fp16 models, feed identical | ||
| `inputs_embeds`, and compare each layer's `corr`/`relL2`. Find the first layer | ||
| where correlation collapses and which **layer type** (attention vs MoE vs linear- | ||
| attention) degrades fastest. | ||
|
|
||
| ### Step 3 — Sub-component isolation | ||
| Within the worst layer, probe sub-tensors (attn output, post-attn residual, MoE | ||
| output, layer residual) int4-vs-fp16 with the **same inputs**. This pinpoints the | ||
| offending op (e.g. `moe_out relL2 0.4` ⟹ the QMoE path). | ||
|
|
||
| ### Step 4 — Op-level standalone test | ||
| Reproduce the suspect op in isolation with small synthetic weights and compare: | ||
| - the ORT op output **vs a pure-PyTorch reference built from the ORIGINAL | ||
| (un-quantized) weights**. **Crucial:** do NOT validate a quantized op only | ||
| against a torch reference that uses the *same* (de)quantization — a sign/scale | ||
| bug can be self-consistent and pass while still corrupting real weights. Always | ||
| anchor against the original full-precision weights. | ||
| - Sweep encoding conventions (transpose y/n, offset 8/zp, signed vs abs scale, | ||
| unpack order) and pick the one that reconstructs the original weights with high | ||
| corr and low relL2. | ||
|
|
||
| ### Step 5 — Weight-fidelity sanity check | ||
| Quantize→dequantize a real weight matrix and measure `relL2` vs the original. | ||
| `relL2 ≈ √2` (~1.41) means the dequant is **random** (encoding bug), not merely | ||
| lossy. Healthy int4 RTN is ~0.05–0.15. This quickly distinguishes an encoding bug | ||
| from genuine quantization quality loss. | ||
|
|
||
| ### Common real bugs (seen in practice) | ||
| - **abs() on signed quant scales** → garbage that looks like "int4 quality loss". | ||
| - **Wrong CUTLASS `force_arch`** in prepacking → finite but wrong output. | ||
| - **Duplicate node/value names** (e.g. two `.../Mul` outputs in a shared-expert | ||
| subgraph) → `ShapeInferenceError` / wrong tensor picked downstream. Give each | ||
| emitted node a unique name. | ||
| - **Missing/incorrect activation attrs** (`activation_alpha/beta`, `swiglu_limit`) | ||
| → degraded but not random. | ||
| - **Quantizing routing-critical tiny matmuls** (router, gate) → wrong expert | ||
| selection; exclude them from int4. | ||
| - **Running Python from a directory that shadows the installed package** (e.g. a | ||
| local `onnxruntime/` folder) → stale code. Run from a neutral dir. | ||
|
|
||
| ## 6. Validation checklist | ||
| - [ ] fp16 export: next-token corr ≥ 0.99 vs HF, coherent greedy text. | ||
| - [ ] int4 export: per-layer corr stays high; final next-token corr ≥ ~0.95. | ||
| - [ ] Coherent end-to-end generation (text and, for VLMs, image prompts). | ||
| - [ ] Memory footprint fits the target GPU. | ||
| - [ ] TTFT / tokens-per-second benchmarked. | ||
| - [ ] Builder edits copied to all install locations; Olive cache cleared before | ||
| re-export. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pack_weights_for_cuda_mixed_gemm existed in onnxruntime-gpu package >= 1.27. It is preferred way to pack weights (no dependency on tensorrt-llm). See https://github.com/microsoft/onnxruntime/blob/main/docs/contrib_ops/cuda/moe_qmoe.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest commit. Removed the references to
_cutlass_prepacked_blockwise_quantize,quantize_matmul_4bits,force_arch=80, and the signed-scale claim. Keptpack_weights_for_cuda_mixed_gemm(fromonnxruntime-gpu >= 1.27) as the preferred, TRT-LLM-free packing approach, and updated the recipe to reflect the actual_symmetric_blockwise_quantize+pack_weights_for_cuda_mixed_gemmflow.