This is used to track the progress of Qwen3.6-35B-A3B Throughput Optimization. Related PRs: For Qwen 3.6: * olive-recipes [Add Qwen3.6-35B-A3B MoE VLM recipe (CUDA + CPU)](https://github.com/microsoft/olive-recipes/pull/492) * onnxruntime-genai [Fix CUDA QMoE INT4 export for Qwen3.5/3.6 MoE models](https://github.com/microsoft/onnxruntime-genai/pull/2209) https://github.com/microsoft/onnxruntime-genai/pull/2218 * cuda op / kernels: https://github.com/microsoft/onnxruntime/pull/28980 https://github.com/microsoft/onnxruntime/pull/28985 https://github.com/microsoft/onnxruntime/pull/28986 (Not needed if we have shared expert optimization like below) https://github.com/microsoft/onnxruntime/pull/29028 https://github.com/microsoft/onnxruntime/pull/29038 (Need to extend to block quantization) https://github.com/microsoft/onnxruntime/pull/29013 Related issues: [AddExternalInitializers copies device (GPU) OrtValues per session instead of using them in place ](https://github.com/microsoft/onnxruntime/issues/29009)
This is used to track the progress of Qwen3.6-35B-A3B Throughput Optimization.
Related PRs:
For Qwen 3.6:
olive-recipes
Add Qwen3.6-35B-A3B MoE VLM recipe (CUDA + CPU)
onnxruntime-genai
Fix CUDA QMoE INT4 export for Qwen3.5/3.6 MoE models
Qwen3.6 MTP onnxruntime-genai#2218
cuda op / kernels:
[CUDA] Optimize QMoE SoftmaxTopK router for small-batch decode #28980
[CUDA] Add decode-optimized LinearAttention (GatedDeltaNet) kernels #28985
[CUDA] Add decode (M=1) GEMV fast path to MatMul #28986 (Not needed if we have shared expert optimization like below)
[CUDA] QMoE support shared experts #29028
[CUDA] QMoE GEMV fast path for batch-1 decode #29038 (Need to extend to block quantization)
Use user-supplied external initializer in place when already on the planned device #29013
Related issues:
AddExternalInitializers copies device (GPU) OrtValues per session instead of using them in place