Skip to content

TurboQuant for ORT WebGPU along with model_bechmark updates to measure GPU memory#2084

Open
sushraja-msft wants to merge 15 commits into
mainfrom
user/sushraja/turbo_quant
Open

TurboQuant for ORT WebGPU along with model_bechmark updates to measure GPU memory#2084
sushraja-msft wants to merge 15 commits into
mainfrom
user/sushraja/turbo_quant

Conversation

@sushraja-msft

@sushraja-msft sushraja-msft commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Add support for kv cache quantization of existing models.

To enable this for webgpu ep provider, make the following change to genai_config.json.

       "session_options": {
            "log_id": "onnxruntime-genai",
            "provider_options": [
                {
                    "webgpu": {
                        "kvCacheQuantizationBits": "4"
                    }
                }
            ]
        },

ORT side changes microsoft/onnxruntime#28059

This change also updates the model_benchmark tool to print GPU memory usage numbers and allow profiling.

Copilot AI review requested due to automatic review settings April 14, 2026 03:08
@sushraja-msft sushraja-msft marked this pull request as draft April 14, 2026 03:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial (“WIP”) plumbing and tooling to support TurboQuant KV-cache compression for ORT WebGPU by allowing GenAI to use a reduced KV head dimension instead of the model’s original head_size.

Changes:

  • Add kv_cache_head_size to genai_config.json parsing and use it when allocating the default KV cache.
  • Add a new tools/prepare_turbo_quant.py helper to update ONNX KV tensor shapes (to a dynamic dim) and update genai_config.json with TurboQuant WebGPU provider options.
  • Add new C++ attention-quality test executables (NIAH + RULER-inspired) and make small build/benchmark updates (ORT_HOME test env handling, WebGPU provider validation).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tools/python/util/dependency_resolver.py Avoid copying directories when copying dependency artifacts.
tools/prepare_turbo_quant.py New script to rewrite ONNX KV cache shapes and update GenAI config for TurboQuant.
src/models/kv_cache.cpp Use kv_cache_head_size (if provided) when sizing the default KV cache allocation.
src/config.h Add optional kv_cache_head_size to decoder config.
src/config.cpp Parse kv_cache_head_size from genai_config.json.
examples/c/src/niah_test.cpp New NIAH attention-quality test program.
examples/c/src/ruler_test.cpp New RULER-inspired attention-quality benchmark program.
examples/c/CMakeLists.txt Add build options/targets for the new NIAH/RULER test executables.
build.py Update Windows generator default/list, improve env/PATH handling for ORT_HOME during tests, adjust examples build cleanup.
benchmark/c/options.cpp Allow webgpu as a benchmark execution provider option.

Comment thread build.py Outdated
Comment thread src/models/kv_cache.cpp Outdated
Comment thread tools/prepare_turbo_quant.py Outdated
Comment thread examples/c/src/niah_test.cpp Outdated
Comment thread examples/c/src/ruler_test.cpp Outdated
@sushraja-msft sushraja-msft changed the title WIP: Turbo Quant support for ORT WebGPU WIP: TurboQuant support for ORT WebGPU Apr 14, 2026
@sushraja-msft sushraja-msft changed the title WIP: TurboQuant support for ORT WebGPU WIP: TurboQuant for ORT WebGPU Apr 14, 2026
@sushraja-msft sushraja-msft force-pushed the user/sushraja/turbo_quant branch from a258c3f to 21ad295 Compare June 4, 2026 06:05
@sushraja-msft sushraja-msft requested a review from Copilot June 12, 2026 02:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread src/models/kv_cache.cpp Outdated
Comment thread src/models/kv_cache.cpp Outdated
Comment thread benchmark/c/windows/resource_utils.cpp Outdated
@sushraja-msft sushraja-msft requested a review from Copilot June 15, 2026 23:24
@sushraja-msft sushraja-msft marked this pull request as ready for review June 15, 2026 23:24
@sushraja-msft sushraja-msft requested a review from a team as a code owner June 15, 2026 23:24
@sushraja-msft sushraja-msft changed the title WIP: TurboQuant for ORT WebGPU TurboQuant for ORT WebGPU along with model_bechmark updates to measure GPU memory Jun 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Comment thread src/config.cpp Outdated
Comment thread benchmark/c/main.cpp Outdated
Comment thread benchmark/c/main.cpp Outdated
Comment thread benchmark/c/windows/resource_utils.cpp Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

benchmark/c/main.cpp:321

  • GPU memory is sampled after the iterations loop, but when reuse_generator is false each iteration's new_gen is destroyed at the end of the loop body. That means by the time GetGpuMemoryUsage() runs, the KV cache for the final iteration may already be freed, producing misleadingly low numbers (despite the comment saying KV cache is still allocated). Keeping the last new_gen alive until after sampling fixes this for the default reuse_generator=false path.
  for (size_t i = 0; i < opts.num_iterations; ++i) {
    std::unique_ptr<OgaGenerator> new_gen;
    if (opts.reuse_generator) {
      generator->RewindTo(0);
    } else {

Comment thread src/models/kv_cache.cpp Outdated
Comment thread benchmark/c/main.cpp
Comment thread benchmark/c/windows/resource_utils.cpp Outdated
sushraja-msft and others added 5 commits June 25, 2026 10:22
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Comment on lines +31 to +38
for (UINT i = 0;; ++i) {
IDXGIAdapter1* adapter1 = nullptr;
if (FAILED(enum_hr) || adapter1 == nullptr) {
if (adapter1 != nullptr) {
adapter1->Release();
}
continue;
}
Comment thread benchmark/c/main.cpp
Comment on lines 366 to +368
// Release the generator before printing results
generator.reset();
new_gen.reset();
Comment thread benchmark/c/main.cpp
Comment on lines +361 to +364
#ifdef _WIN32
// Capture GPU memory before releasing the generator (while KV cache is still allocated)
const auto gpu_mem = benchmark::utils::GetGpuMemoryUsage();
#endif
Comment thread src/models/kv_cache.cpp
Comment on lines +153 to +157
// Compute the compressed KV cache head dimension for quantized KV caches.
// The quantizer packs each head into (1 + head_size / indices_per_word) u32 words: one fp32 scale followed by
// head_size values quantized to 4 or 8 bits (4-bit: 8 values/u32, 8-bit: 4 values/u32).
// The tensor dimension depends on the element type.
// head_size >= 8). If kv_cache_quantization_bits is enabled with an invalid head_size, this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants