Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions src/python/py/models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,7 @@ Note that this is the same as outputting embeddings since the last hidden states

#### Enable Shared Embeddings

This scenario is for when you want to enable weight sharing between the embedding layer and the language modeling head. This reduces model size and can improve memory efficiency, especially useful for models with tied embeddings (where `tie_word_embeddings=true` in config.json). Shared embeddings are automatically enabled if `tie_word_embeddings=true` in the model's config.json (can be overridden with `shared_embeddings=false`), but cannot be used with `exclude_embeds=true` or `exclude_lm_head=true`.
This scenario is for when you want to enable weight sharing between the embedding layer and the language modeling head. This reduces model size and can improve memory efficiency, especially useful for models with tied embeddings (where `tie_word_embeddings=true` in config.json). Shared embeddings are automatically enabled if `tie_word_embeddings=true` in the model's config.json (can be overridden with `shared_embeddings=false`), but cannot be used with `exclude_embeds=true` or `exclude_lm_head=true`.

##### Option 1: INT4 (for RTN and K-Quant)
```
Expand Down Expand Up @@ -272,7 +272,7 @@ python builder.py -m model_name -o path_to_output_folder -p fp16 -e cuda --extra

#### Disable QKV Projections Fusion

This scenario is for when you want to keep Q/K/V projections in the attention layer separate instead of fusing them into a single packed MatMul operation.
This scenario is for when you want to keep Q/K/V projections in the attention layer separate instead of fusing them into a single packed MatMul operation.

```
# From wheel:
Expand All @@ -282,6 +282,18 @@ python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o pa
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qkv_fusion=true
```

#### Disable QK Norm Fusion

This scenario is for when you want to emit explicit Q/K normalization nodes instead of passing Q/K norm weights directly into GroupQueryAttention.

Comment thread
tianleiwu marked this conversation as resolved.
Outdated
```
# From wheel:
python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qk_norm_fusion=true

# From source:
python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qk_norm_fusion=true
```

#### Enable CUDA Graph

This scenario is for when you want to enable CUDA graph for your ONNX model.
Expand Down
28 changes: 20 additions & 8 deletions src/python/py/models/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,11 @@ def check_extra_options(kv_pairs, execution_provider):
"shared_embeddings",
"hf_remote",
"disable_qkv_fusion",
"disable_qk_norm_fusion",
"prune_lm_head",
"last_matmul_weight_int8",
"int8_mixed_layers",
"int8_linear_attn",
]
for key in bools:
if key in kv_pairs:
Expand Down Expand Up @@ -415,8 +419,9 @@ def get_args():
Default is 4 for the CPU EP and 0 for non-CPU EPs.
int4_block_size = 16/32/64/128/256: Specify the block size for int4 quantization (MatMulNBits).
Default value is 32.
qmoe_block_size = 16/32/64/128/256: Specify the block size for QMoE expert weights quantization.
Default is 128 for CUDA and TRT-RTX, 32 for others. Supported EPs: CPU, CUDA, WebGPU, TRT-RTX.
qmoe_block_size = -1/0/32/64/128: Specify the block size for QMoE expert weights quantization.
Set <= 0 for per-channel quantization. Default is 128 for TRT-RTX, 32 for others.
Supported EPs: CPU, CUDA, WebGPU, TRT-RTX.
Comment thread
tianleiwu marked this conversation as resolved.
Outdated
int4_is_symmetric = Quantize the weights symmetrically. Default is true.
If true, quantization is done to int4. If false, quantization is done to uint4.
int4_op_types_to_quantize = MatMul/Gather: Specify op types to target for int4 quantization.
Expand All @@ -425,15 +430,22 @@ def get_args():
int4_nodes_to_exclude = Specify nodes to exclude from int4 quantization.
Use this option when you want to exclude certain nodes from being quantized.
Separate the node names with a ',' when passing them here (e.g. int4_nodes_to_exclude=/lm_head/MatMul,/model/embed_tokens/Gather)
int4_algo_config = Method for int4 quantization. Default is 'default'.
Currently supported options are: 'default', 'rtn', 'rtn_last', 'k_quant', 'k_quant_mixed', 'k_quant_last', 'k_quant_linear'.
int4_algo_config = Base method for int4 quantization. Default is 'default'.
Currently supported base methods are: 'default', 'rtn', 'k_quant'.
default = algo_config passed to MatMulNBitsQuantizer is None. Quantizer uses default RTN algorithm. All MatMuls are quantized as int4.(different node naming conventions to `rtn`)
rtn = RTN algorithm for int4 quantization.
rtn_last = RTN algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4.
k_quant = k_quant algorithm for int4 quantization.
k_quant_mixed = k_quant algorithm with mixed precision (int4 + int8).
k_quant_last = k_quant algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4.
k_quant_linear = k_quant algorithm with linear attention layer projections and MLPs promoted to int8 (for hybrid attention models like Qwen3.5).
The following legacy compound values are still accepted as aliases (base method + int8 placement flags):
rtn_last = rtn + last_matmul_weight_int8=true.
k_quant_last = k_quant + last_matmul_weight_int8=true.
k_quant_mixed = k_quant + last_matmul_weight_int8=true + int8_mixed_layers=true.
k_quant_linear = k_quant + last_matmul_weight_int8=true + int8_linear_attn=true.
last_matmul_weight_int8 = Quantize the last MatMul (e.g. /lm_head/MatMul) as int8 instead of int4. Default is false.
Orthogonal to int4_algo_config; can be combined with any base method ('default', 'rtn', 'k_quant').
int8_mixed_layers = Promote the most quantization-sensitive MatMuls (llama.cpp mixed strategy: first/last eighth of layers plus every third layer's qkv_proj/v_proj/down_proj) to int8. Default is false.
Orthogonal to int4_algo_config; can be combined with any base method.
int8_linear_attn = Promote linear-attention projections and their MLPs to int8 (for hybrid attention models like Qwen3.5). Default is false.
Orthogonal to int4_algo_config; can be combined with any base method.
shared_embeddings = Enable weight sharing between embedding and LM head layers. Default is false.
Use this option to share weights and reduce model size by eliminating duplicate weights.
For quantized models (INT4/UINT4): Shares quantized weights using GatherBlockQuantized. Only works with rtn and k_quant algorithms, and cannot be used if LM head is excluded.
Expand Down
Loading
Loading