TurboQuant for ORT WebGPU along with model_bechmark updates to measure GPU memory by sushraja-msft · Pull Request #2084 · microsoft/onnxruntime-genai

sushraja-msft · 2026-04-14T03:08:11Z

Add support for kv cache quantization of existing models.

To enable this for webgpu ep provider, make the following change to genai_config.json.

       "session_options": {
            "log_id": "onnxruntime-genai",
            "provider_options": [
                {
                    "webgpu": {
                        "kvCacheQuantizationBits": "4"
                    }
                }
            ]
        },

ORT side changes microsoft/onnxruntime#28059

This change also updates the model_benchmark tool to print GPU memory usage numbers and allow profiling.

Copilot

Pull request overview

This PR adds initial (“WIP”) plumbing and tooling to support TurboQuant KV-cache compression for ORT WebGPU by allowing GenAI to use a reduced KV head dimension instead of the model’s original head_size.

Changes:

Add kv_cache_head_size to genai_config.json parsing and use it when allocating the default KV cache.
Add a new tools/prepare_turbo_quant.py helper to update ONNX KV tensor shapes (to a dynamic dim) and update genai_config.json with TurboQuant WebGPU provider options.
Add new C++ attention-quality test executables (NIAH + RULER-inspired) and make small build/benchmark updates (ORT_HOME test env handling, WebGPU provider validation).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tools/python/util/dependency_resolver.py	Avoid copying directories when copying dependency artifacts.
tools/prepare_turbo_quant.py	New script to rewrite ONNX KV cache shapes and update GenAI config for TurboQuant.
src/models/kv_cache.cpp	Use `kv_cache_head_size` (if provided) when sizing the default KV cache allocation.
src/config.h	Add optional `kv_cache_head_size` to decoder config.
src/config.cpp	Parse `kv_cache_head_size` from `genai_config.json`.
examples/c/src/niah_test.cpp	New NIAH attention-quality test program.
examples/c/src/ruler_test.cpp	New RULER-inspired attention-quality benchmark program.
examples/c/CMakeLists.txt	Add build options/targets for the new NIAH/RULER test executables.
build.py	Update Windows generator default/list, improve env/PATH handling for ORT_HOME during tests, adjust examples build cleanup.
benchmark/c/options.cpp	Allow `webgpu` as a benchmark execution provider option.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

…omment

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

benchmark/c/main.cpp:321

GPU memory is sampled after the iterations loop, but when reuse_generator is false each iteration's new_gen is destroyed at the end of the loop body. That means by the time GetGpuMemoryUsage() runs, the KV cache for the final iteration may already be freed, producing misleadingly low numbers (despite the comment saying KV cache is still allocated). Keeping the last new_gen alive until after sampling fixes this for the default reuse_generator=false path.

  for (size_t i = 0; i < opts.num_iterations; ++i) {
    std::unique_ptr<OgaGenerator> new_gen;
    if (opts.reuse_generator) {
      generator->RewindTo(0);
    } else {

…ls can work with quantized kv cache.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…oft/onnxruntime-genai into user/sushraja/tq_3

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

+  for (UINT i = 0;; ++i) {
+    IDXGIAdapter1* adapter1 = nullptr;
+    if (FAILED(enum_hr) || adapter1 == nullptr) {
+      if (adapter1 != nullptr) {
+        adapter1->Release();
+      }
+      continue;
+    }


  // Release the generator before printing results
  generator.reset();
+  new_gen.reset();


+#ifdef _WIN32
+  // Capture GPU memory before releasing the generator (while KV cache is still allocated)
+  const auto gpu_mem = benchmark::utils::GetGpuMemoryUsage();
+#endif


+// Compute the compressed KV cache head dimension for quantized KV caches.
+// The quantizer packs each head into (1 + head_size / indices_per_word) u32 words: one fp32 scale followed by
+// head_size values quantized to 4 or 8 bits (4-bit: 8 values/u32, 8-bit: 4 values/u32).
+// The tensor dimension depends on the element type.
+// head_size >= 8). If kv_cache_quantization_bits is enabled with an invalid head_size, this


Copilot AI review requested due to automatic review settings April 14, 2026 03:08

sushraja-msft marked this pull request as draft April 14, 2026 03:08

Copilot started reviewing on behalf of sushraja-msft April 14, 2026 03:10 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread build.py Outdated

Comment thread src/models/kv_cache.cpp Outdated

Comment thread tools/prepare_turbo_quant.py Outdated

Comment thread examples/c/src/niah_test.cpp Outdated

Comment thread examples/c/src/ruler_test.cpp Outdated

sushraja-msft changed the title ~~WIP: Turbo Quant support for ORT WebGPU~~ WIP: TurboQuant support for ORT WebGPU Apr 14, 2026

sushraja-msft changed the title ~~WIP: TurboQuant support for ORT WebGPU~~ WIP: TurboQuant for ORT WebGPU Apr 14, 2026

sushraja-msft force-pushed the user/sushraja/turbo_quant branch from a258c3f to 21ad295 Compare June 4, 2026 06:05

sushraja-msft requested a review from Copilot June 12, 2026 02:02

Copilot started reviewing on behalf of sushraja-msft June 12, 2026 02:03 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread src/models/kv_cache.cpp Outdated

Comment thread src/models/kv_cache.cpp Outdated

Comment thread benchmark/c/windows/resource_utils.cpp Outdated

Copilot started work on behalf of sushraja-msft June 12, 2026 02:22 View session

Copilot finished work on behalf of sushraja-msft June 12, 2026 02:24

Copilot started work on behalf of sushraja-msft June 12, 2026 02:29 View session

Copilot started work on behalf of sushraja-msft June 12, 2026 02:30 View session

Copilot finished work on behalf of sushraja-msft June 12, 2026 02:31

Copilot finished work on behalf of sushraja-msft June 12, 2026 02:32

Copilot started work on behalf of sushraja-msft June 12, 2026 18:27 View session

Copilot finished work on behalf of sushraja-msft June 12, 2026 18:29

sushraja-msft requested a review from Copilot June 15, 2026 23:24

sushraja-msft marked this pull request as ready for review June 15, 2026 23:24

sushraja-msft requested a review from a team as a code owner June 15, 2026 23:24

sushraja-msft changed the title ~~WIP: TurboQuant for ORT WebGPU~~ TurboQuant for ORT WebGPU along with model_bechmark updates to measure GPU memory Jun 15, 2026

Copilot started reviewing on behalf of sushraja-msft June 15, 2026 23:25 View session

sushraja-msft assigned kunal-vaishnavi, baijumeswani and guschmue Jun 15, 2026

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread src/config.cpp Outdated

Comment thread benchmark/c/main.cpp Outdated

Comment thread benchmark/c/main.cpp Outdated

Comment thread benchmark/c/windows/resource_utils.cpp Outdated

sushraja-msft added 3 commits June 24, 2026 21:25

make the config property be an integer

4eecb2e

Model benchmark improvements.

3fe1563

ensure gpu memory is reported on windows only

8d647fe

Copilot AI and others added 7 commits June 24, 2026 21:25

kv_cache: silently disable TurboQuant if head_size < 8 or not power of 2

d3ab4bf

fix: validate turboQuant string parsing and remove redundant pragma c…

6846987

…omment

Fix parsing of turboquant

02d47eb

Fix review comments

06c789f

Fix formatting error

2307b30

Commit additional comment

524fb92

use generic term for quantization

b540b33

sushraja-msft requested a review from Copilot June 25, 2026 04:52

sushraja-msft force-pushed the user/sushraja/turbo_quant branch from 178654b to b540b33 Compare June 25, 2026 04:52

Copilot started reviewing on behalf of sushraja-msft June 25, 2026 04:52 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread src/models/kv_cache.cpp Outdated

Comment thread benchmark/c/main.cpp

Comment thread benchmark/c/windows/resource_utils.cpp Outdated

sushraja-msft and others added 5 commits June 25, 2026 10:22

Update builder.py to leave the last headsize as symbolic so that mode…

ce9488c

…ls can work with quantized kv cache.

Potential fix for pull request finding

124299f

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

6a5977e

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Update with copilot feedback

8991bb3

Merge branch 'user/sushraja/turbo_quant' of https://github.com/micros…

665bcaf

…oft/onnxruntime-genai into user/sushraja/tq_3

sushraja-msft requested a review from Copilot June 25, 2026 22:33

Copilot started reviewing on behalf of sushraja-msft June 25, 2026 22:33 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Uh oh!

Conversation

sushraja-msft commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sushraja-msft commented Apr 14, 2026 •

edited

Loading