TurboQuant for ORT WebGPU along with model_bechmark updates to measure GPU memory#2084
Open
sushraja-msft wants to merge 15 commits into
Open
TurboQuant for ORT WebGPU along with model_bechmark updates to measure GPU memory#2084sushraja-msft wants to merge 15 commits into
sushraja-msft wants to merge 15 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds initial (“WIP”) plumbing and tooling to support TurboQuant KV-cache compression for ORT WebGPU by allowing GenAI to use a reduced KV head dimension instead of the model’s original head_size.
Changes:
- Add
kv_cache_head_sizetogenai_config.jsonparsing and use it when allocating the default KV cache. - Add a new
tools/prepare_turbo_quant.pyhelper to update ONNX KV tensor shapes (to a dynamic dim) and updategenai_config.jsonwith TurboQuant WebGPU provider options. - Add new C++ attention-quality test executables (NIAH + RULER-inspired) and make small build/benchmark updates (ORT_HOME test env handling, WebGPU provider validation).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/python/util/dependency_resolver.py | Avoid copying directories when copying dependency artifacts. |
| tools/prepare_turbo_quant.py | New script to rewrite ONNX KV cache shapes and update GenAI config for TurboQuant. |
| src/models/kv_cache.cpp | Use kv_cache_head_size (if provided) when sizing the default KV cache allocation. |
| src/config.h | Add optional kv_cache_head_size to decoder config. |
| src/config.cpp | Parse kv_cache_head_size from genai_config.json. |
| examples/c/src/niah_test.cpp | New NIAH attention-quality test program. |
| examples/c/src/ruler_test.cpp | New RULER-inspired attention-quality benchmark program. |
| examples/c/CMakeLists.txt | Add build options/targets for the new NIAH/RULER test executables. |
| build.py | Update Windows generator default/list, improve env/PATH handling for ORT_HOME during tests, adjust examples build cleanup. |
| benchmark/c/options.cpp | Allow webgpu as a benchmark execution provider option. |
a258c3f to
21ad295
Compare
178654b to
b540b33
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
benchmark/c/main.cpp:321
- GPU memory is sampled after the iterations loop, but when
reuse_generatoris false each iteration'snew_genis destroyed at the end of the loop body. That means by the timeGetGpuMemoryUsage()runs, the KV cache for the final iteration may already be freed, producing misleadingly low numbers (despite the comment saying KV cache is still allocated). Keeping the lastnew_genalive until after sampling fixes this for the defaultreuse_generator=falsepath.
for (size_t i = 0; i < opts.num_iterations; ++i) {
std::unique_ptr<OgaGenerator> new_gen;
if (opts.reuse_generator) {
generator->RewindTo(0);
} else {
…ls can work with quantized kv cache.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…oft/onnxruntime-genai into user/sushraja/tq_3
Comment on lines
+31
to
+38
| for (UINT i = 0;; ++i) { | ||
| IDXGIAdapter1* adapter1 = nullptr; | ||
| if (FAILED(enum_hr) || adapter1 == nullptr) { | ||
| if (adapter1 != nullptr) { | ||
| adapter1->Release(); | ||
| } | ||
| continue; | ||
| } |
Comment on lines
366
to
+368
| // Release the generator before printing results | ||
| generator.reset(); | ||
| new_gen.reset(); |
Comment on lines
+361
to
+364
| #ifdef _WIN32 | ||
| // Capture GPU memory before releasing the generator (while KV cache is still allocated) | ||
| const auto gpu_mem = benchmark::utils::GetGpuMemoryUsage(); | ||
| #endif |
Comment on lines
+153
to
+157
| // Compute the compressed KV cache head dimension for quantized KV caches. | ||
| // The quantizer packs each head into (1 + head_size / indices_per_word) u32 words: one fp32 scale followed by | ||
| // head_size values quantized to 4 or 8 bits (4-bit: 8 values/u32, 8-bit: 4 values/u32). | ||
| // The tensor dimension depends on the element type. | ||
| // head_size >= 8). If kv_cache_quantization_bits is enabled with an invalid head_size, this |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for kv cache quantization of existing models.
To enable this for webgpu ep provider, make the following change to genai_config.json.
ORT side changes microsoft/onnxruntime#28059
This change also updates the model_benchmark tool to print GPU memory usage numbers and allow profiling.