[pull] master from tensorflow:master#1981
Merged
Merged
Conversation
Imported from GitHub PR openxla/xla#43894 📝 Summary of Changes Inserts `griddepcontrol.launch_dependents` instructions between Triton GEMM mainloops and epilogues. They enable further overlaps of kernels on top of the existing base implementation (openxla/xla#38544). The insertion logic is based on the observation that most Triton GEMMs, even those without epilogue in HLO, do have epilogues at lower levels - moving / transposing output data through shared memory. This creates an opportunity to increase productive overlaps with subsequent kernels. Using launch_dependents instructions requires disabling non-coherent / invariant loads for dependent data in successors being launched as these can not be guarded by their PDL waits. 🎯 Justification Speeds up (typically inference) benchmarks by overlapping independent phases of kernels. 🚀 Kind of Contribution ⚡️ Performance Improvement 📊 Benchmark (for Performance Improvements) Measured on H100 using CUDA events: | Benchmark | Speedup | Error | |-----------------------------------------------------|---------|-------| | gemma2_2b_keras_jax | 1.01x | 0.00x | | gemma3_1b_flax_call | 1.01x | 0.00x | | gemma3_1b_flax_sample_loop | 1.00x | 0.00x | | gemma3_4b_flax_call | 1.01x | 0.00x | | gemma3_4b_flax_sample_loop | 1.01x | 0.00x | | gemma3_12b_flax_call | 1.01x | 0.00x | | gemma3_12b_flax_sample_loop | 1.00x | 0.00x | | gpu_hlo | 1.00x | 0.01x | | hlo_gemma4_2b_bf16 | 1.00x | 0.00x | | hlo_llama31_8b_bf16_1x8 | 1.00x | 0.01x | | hlo_llama31_8b_fp8_1x8 | 1.01x | 0.05x | | hlo_mixtral_8x7b_bf16_1x8 | 1.00x | 0.08x | | nv_maxtext_1n1g_jit_train_step_before_optimization | 1.00x | 0.00x | | u4_all_gather_1x8 | 0.99x | 0.03x | 🧪 Unit Tests: yes 🧪 Execution Tests: no Copybara import of the project: -- 711244dcbece02d56400c097adb5419f6856da52 by Ilia Sergachev <isergachev@nvidia.com>: [GPU] Add PDL launch instruction insertion. Merging this change closes #43894 PiperOrigin-RevId: 939804756
PiperOrigin-RevId: 939806406
…one file Pure refactoring step with no behavior or content modifications. This will make it easier to review further changes to the HTML content. PiperOrigin-RevId: 939807152
PiperOrigin-RevId: 939821154
This will let Fusion Explorer expose more information about the choices made by PriorityFusion. PiperOrigin-RevId: 939822742
Document the HLO parser support for desugaring suffix-based async operations (e.g., dot-start, dot-update, dot-done) and the variadic nature of async-update. PiperOrigin-RevId: 939897472
Imported from GitHub PR openxla/xla#44389 📝 Summary of Changes Enables float and buffer xor checker thunks on ROCm 🎯 Justification Enables more debugging tools on ROCm platform 🚀 Kind of Contribution ✨ New Feature 📊 Benchmark (for Performance Improvements) N\A 🧪 Unit Tests: None 🧪 Execution Tests: Moved cuda specific tests into: //xla/stream_executor/gpu:buffer_debug_float_check_kernel_test //xla/stream_executor/gpu:buffer_debug_xor_checksum_kernel_test Copybara import of the project: -- 275390bf545b29b177820195ec4a209d77310093 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>: [ROCm] Enable float and buffer checker Merging this change closes #44389 Manual patches: - Rename `*_lib.cu.h` headers to `*_lib.cu.h.inc`. Those headers need to be included after defining a platform-specific definition of `kWarpSize` constant, and don't work without it. However, putting them in `srcs` makes some internal test builds attempt to compile the header by itself - which fails because it's not self-sufficient. - Add missing `compatible_with = "//buildenv/target:non_prod"`. PiperOrigin-RevId: 939902012
PiperOrigin-RevId: 939920098
…lysis. The include and build dependency for tsl/platform/errors are not used in hlo_dataflow_analysis. PiperOrigin-RevId: 939932250
PiperOrigin-RevId: 939946290
…egate PiperOrigin-RevId: 939950090
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )