Skip to content

Add agent skill for creating new model#2206

Draft
tianleiwu wants to merge 2 commits into
mainfrom
tlwu/new-model-skills
Draft

Add agent skill for creating new model#2206
tianleiwu wants to merge 2 commits into
mainfrom
tlwu/new-model-skills

Conversation

@tianleiwu

Copy link
Copy Markdown
Contributor

Add a skill for agents to create new model.

Copilot AI review requested due to automatic review settings June 9, 2026 23:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new agent skill document intended to guide contributors/agents through adding support for new HuggingFace model architectures in the Python model builder and debugging ONNX-vs-PyTorch numerical parity issues.

Changes:

  • Introduces .github/skills/new-model/SKILL.md with file orientation, builder dispatch guidance, and an export/debugging workflow.
  • Documents MoE/QMoE-related implementation notes and common parity failure modes.

Comment thread .github/skills/new-model/SKILL.md Outdated
Comment on lines +72 to +80
- `make_moe_op` emits `MoE` (fp16) or `QMoE` (int4/int8). `make_qmoe_weights`
quantizes and packs each expert weight `[N, K]`.
- **CUDA QMoE weight encoding (critical):** the kernel is a CUTLASS fpA_intB
mixed GEMM that consumes **offline-prepacked** weights. The proven recipe
(see `_cutlass_prepacked_blockwise_quantize` in `base.py`):
1. transpose weight to `[K, N]`;
2. `onnxruntime...quantize_matmul_4bits(qw, w_T, scales, zp, block, N, K, is_symmetric=True)`;
3. **keep the SIGNED scales** — do NOT `abs()` them. The kernel dequantizes as
`(q - 8) * scale`, and `quantize_matmul_4bits` folds the block-anchor sign

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pack_weights_for_cuda_mixed_gemm existed in onnxruntime-gpu package >= 1.27. It is preferred way to pack weights (no dependency on tensorrt-llm). See https://github.com/microsoft/onnxruntime/blob/main/docs/contrib_ops/cuda/moe_qmoe.md

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit. Removed the references to _cutlass_prepacked_blockwise_quantize, quantize_matmul_4bits, force_arch=80, and the signed-scale claim. Kept pack_weights_for_cuda_mixed_gemm (from onnxruntime-gpu >= 1.27) as the preferred, TRT-LLM-free packing approach, and updated the recipe to reflect the actual _symmetric_blockwise_quantize + pack_weights_for_cuda_mixed_gemm flow.

Comment thread .github/skills/new-model/SKILL.md Outdated
- The QMoE node then uses the **default** `weights_prepacked` (omit the attribute;
default = prepacked). Do **not** set `weights_prepacked=0` (the raw-weight +
runtime-PrePack-hook path is finiteness-checked only and is not bit-correct).
- **CUDA QMoE only supports `block_size` 64 or 128.** Assert this in the builder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is limitation (block size 64/128) of cuda QMoE op. See https://github.com/microsoft/onnxruntime/blob/main/docs/contrib_ops/cuda/moe_qmoe.md for details.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — updated the bullet to state that qmoe_block_size supports values 16, 32, 64, 128, or 256 (default 128 for CUDA/TRT-RTX, 32 otherwise), and that _symmetric_blockwise_quantize pads automatically.

Comment thread .github/skills/new-model/SKILL.md Outdated
Comment on lines +90 to +92
- The QMoE node then uses the **default** `weights_prepacked` (omit the attribute;
default = prepacked). Do **not** set `weights_prepacked=0` (the raw-weight +
runtime-PrePack-hook path is finiteness-checked only and is not bit-correct).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the weights_prepacked bullet entirely — make_qmoe_op does not set that attribute and it has no presence in the codebase.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weights_prepacked attribute is a new attribute added in latest onnxruntime: microsoft/onnxruntime#28749.

@tianleiwu tianleiwu marked this pull request as draft June 10, 2026 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants