microsoft · tianleiwu · Jun 19, 2026 · Jun 20, 2026 · Jun 20, 2026 · Jun 20, 2026
@@ -232,7 +232,7 @@ Note that this is the same as outputting embeddings since the last hidden states
 
 #### Enable Shared Embeddings
 
-This scenario is for when you want to enable weight sharing between the embedding layer and the language modeling head. This reduces model size and can improve memory efficiency, especially useful for models with tied embeddings (where `tie_word_embeddings=true` in config.json). Shared embeddings are automatically enabled if `tie_word_embeddings=true` in the model's config.json (can be overridden with `shared_embeddings=false`), but cannot be used with `exclude_embeds=true` or `exclude_lm_head=true`. 
+This scenario is for when you want to enable weight sharing between the embedding layer and the language modeling head. This reduces model size and can improve memory efficiency, especially useful for models with tied embeddings (where `tie_word_embeddings=true` in config.json). Shared embeddings are automatically enabled if `tie_word_embeddings=true` in the model's config.json (can be overridden with `shared_embeddings=false`), but cannot be used with `exclude_embeds=true` or `exclude_lm_head=true`.
 
 ##### Option 1: INT4 (for RTN and K-Quant)
 ```
@@ -272,7 +272,7 @@ python builder.py -m model_name -o path_to_output_folder -p fp16 -e cuda --extra
 
 #### Disable QKV Projections Fusion
 
-This scenario is for when you want to keep Q/K/V projections in the attention layer separate instead of fusing them into a single packed MatMul operation. 
+This scenario is for when you want to keep Q/K/V projections in the attention layer separate instead of fusing them into a single packed MatMul operation.
 
 ```
 # From wheel:
@@ -282,6 +282,18 @@ python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o pa
 python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qkv_fusion=true
 ```
 
+#### Disable QK Norm Fusion
+
+This scenario is for when you want to emit explicit Q/K normalization nodes instead of passing Q/K norm weights directly into GroupQueryAttention.
+
+```
+# From wheel:
+python -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qk_norm_fusion=true
+
+# From source:
+python builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_store_temp_files --extra_options disable_qk_norm_fusion=true
+```
+
 #### Enable CUDA Graph
 
 This scenario is for when you want to enable CUDA graph for your ONNX model.

@@ -75,7 +75,11 @@ def check_extra_options(kv_pairs, execution_provider):
         "shared_embeddings",
         "hf_remote",
         "disable_qkv_fusion",
+        "disable_qk_norm_fusion",
         "prune_lm_head",
+        "last_matmul_weight_int8",
+        "int8_mixed_layers",
+        "int8_linear_attn",
     ]
     for key in bools:
         if key in kv_pairs:
@@ -415,8 +419,9 @@ def get_args():
                     Default is 4 for the CPU EP and 0 for non-CPU EPs.
                 int4_block_size = 16/32/64/128/256: Specify the block size for int4 quantization (MatMulNBits).
                     Default value is 32.
-                qmoe_block_size = 16/32/64/128/256: Specify the block size for QMoE expert weights quantization.
-                    Default is 128 for CUDA and TRT-RTX, 32 for others. Supported EPs: CPU, CUDA, WebGPU, TRT-RTX.
+                qmoe_block_size = -1/0/32/64/128: Specify the block size for QMoE expert weights quantization.
+                    Set <= 0 for per-channel quantization. Default is 128 for TRT-RTX, 32 for others.
+                    Supported EPs: CPU, CUDA, WebGPU, TRT-RTX.
                 int4_is_symmetric = Quantize the weights symmetrically. Default is true.
                     If true, quantization is done to int4. If false, quantization is done to uint4.
                 int4_op_types_to_quantize = MatMul/Gather: Specify op types to target for int4 quantization.
@@ -425,15 +430,22 @@ def get_args():
                 int4_nodes_to_exclude = Specify nodes to exclude from int4 quantization.
                     Use this option when you want to exclude certain nodes from being quantized.
                     Separate the node names with a ',' when passing them here (e.g. int4_nodes_to_exclude=/lm_head/MatMul,/model/embed_tokens/Gather)
-                int4_algo_config = Method for int4 quantization. Default is 'default'.
-                    Currently supported options are: 'default', 'rtn', 'rtn_last', 'k_quant', 'k_quant_mixed', 'k_quant_last', 'k_quant_linear'.
+                int4_algo_config = Base method for int4 quantization. Default is 'default'.
+                    Currently supported base methods are: 'default', 'rtn', 'k_quant'.
                     default = algo_config passed to MatMulNBitsQuantizer is None. Quantizer uses default RTN algorithm. All MatMuls are quantized as int4.(different node naming conventions to `rtn`)
                     rtn = RTN algorithm for int4 quantization.
-                    rtn_last = RTN algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4.
                     k_quant = k_quant algorithm for int4 quantization.
-                    k_quant_mixed = k_quant algorithm with mixed precision (int4 + int8).
-                    k_quant_last = k_quant algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4.
-                    k_quant_linear = k_quant algorithm with linear attention layer projections and MLPs promoted to int8 (for hybrid attention models like Qwen3.5).
+                    The following legacy compound values are still accepted as aliases (base method + int8 placement flags):
+                    rtn_last = rtn + last_matmul_weight_int8=true.
+                    k_quant_last = k_quant + last_matmul_weight_int8=true.
+                    k_quant_mixed = k_quant + last_matmul_weight_int8=true + int8_mixed_layers=true.
+                    k_quant_linear = k_quant + last_matmul_weight_int8=true + int8_linear_attn=true.
+                last_matmul_weight_int8 = Quantize the last MatMul (e.g. /lm_head/MatMul) as int8 instead of int4. Default is false.
+                    Orthogonal to int4_algo_config; can be combined with any base method ('default', 'rtn', 'k_quant').
+                int8_mixed_layers = Promote the most quantization-sensitive MatMuls (llama.cpp mixed strategy: first/last eighth of layers plus every third layer's qkv_proj/v_proj/down_proj) to int8. Default is false.
+                    Orthogonal to int4_algo_config; can be combined with any base method.
+                int8_linear_attn = Promote linear-attention projections and their MLPs to int8 (for hybrid attention models like Qwen3.5). Default is false.
+                    Orthogonal to int4_algo_config; can be combined with any base method.
                 shared_embeddings = Enable weight sharing between embedding and LM head layers. Default is false.
                     Use this option to share weights and reduce model size by eliminating duplicate weights.
                     For quantized models (INT4/UINT4): Shares quantized weights using GatherBlockQuantized. Only works with rtn and k_quant algorithms, and cannot be used if LM head is excluded.