Skip to content

[pull] master from tensorflow:master#1981

Merged
pull[bot] merged 11 commits into
GesuBackups:masterfrom
tensorflow:master
Jun 29, 2026
Merged

[pull] master from tensorflow:master#1981
pull[bot] merged 11 commits into
GesuBackups:masterfrom
tensorflow:master

Conversation

@pull

@pull pull Bot commented Jun 29, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

sergachev and others added 11 commits June 29, 2026 06:55
Imported from GitHub PR openxla/xla#43894

📝 Summary of Changes
Inserts `griddepcontrol.launch_dependents` instructions between Triton GEMM mainloops and epilogues. They enable further overlaps of kernels on top of the existing base implementation (openxla/xla#38544). The insertion logic is based on the observation that most Triton GEMMs, even those without epilogue in HLO, do have epilogues at lower levels - moving / transposing output data through shared memory. This creates an opportunity to increase productive overlaps with subsequent kernels.

Using launch_dependents instructions requires disabling non-coherent / invariant loads for dependent data in successors being launched as these can not be guarded by their PDL waits.

🎯 Justification
Speeds up (typically inference) benchmarks by overlapping independent phases of kernels.

🚀 Kind of Contribution
⚡️ Performance Improvement

📊 Benchmark (for Performance Improvements)
Measured on H100 using CUDA events:
| Benchmark                                           | Speedup | Error |
|-----------------------------------------------------|---------|-------|
|  gemma2_2b_keras_jax                                |   1.01x | 0.00x |
|  gemma3_1b_flax_call                                |   1.01x | 0.00x |
|  gemma3_1b_flax_sample_loop                         |   1.00x | 0.00x |
|  gemma3_4b_flax_call                                |   1.01x | 0.00x |
|  gemma3_4b_flax_sample_loop                         |   1.01x | 0.00x |
|  gemma3_12b_flax_call                               |   1.01x | 0.00x |
|  gemma3_12b_flax_sample_loop                        |   1.00x | 0.00x |
|  gpu_hlo                                            |   1.00x | 0.01x |
|  hlo_gemma4_2b_bf16                                 |   1.00x | 0.00x |
|  hlo_llama31_8b_bf16_1x8                            |   1.00x | 0.01x |
|  hlo_llama31_8b_fp8_1x8                             |   1.01x | 0.05x |
|  hlo_mixtral_8x7b_bf16_1x8                          |   1.00x | 0.08x |
|  nv_maxtext_1n1g_jit_train_step_before_optimization |   1.00x | 0.00x |
|  u4_all_gather_1x8                                  |   0.99x | 0.03x |

🧪 Unit Tests:
yes

🧪 Execution Tests:
no

Copybara import of the project:

--
711244dcbece02d56400c097adb5419f6856da52 by Ilia Sergachev <isergachev@nvidia.com>:

[GPU] Add PDL launch instruction insertion.

Merging this change closes #43894

PiperOrigin-RevId: 939804756
…one file

Pure refactoring step with no behavior or content modifications. This will make it easier to review further changes to the HTML content.

PiperOrigin-RevId: 939807152
PiperOrigin-RevId: 939821154
This will let Fusion Explorer expose more information about the choices made by PriorityFusion.

PiperOrigin-RevId: 939822742
Document the HLO parser support for desugaring suffix-based async
operations (e.g., dot-start, dot-update, dot-done) and the variadic
nature of async-update.

PiperOrigin-RevId: 939897472
Imported from GitHub PR openxla/xla#44389

📝 Summary of Changes
Enables float and buffer xor checker thunks on ROCm

🎯 Justification
Enables more debugging tools on ROCm platform

🚀 Kind of Contribution
✨ New Feature

📊 Benchmark (for Performance Improvements)
N\A

🧪 Unit Tests:
None

🧪 Execution Tests:
Moved cuda specific tests into:
//xla/stream_executor/gpu:buffer_debug_float_check_kernel_test
//xla/stream_executor/gpu:buffer_debug_xor_checksum_kernel_test

Copybara import of the project:

--
275390bf545b29b177820195ec4a209d77310093 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] Enable float and buffer checker

Merging this change closes #44389

Manual patches:
- Rename `*_lib.cu.h` headers to `*_lib.cu.h.inc`. Those headers need to
  be included after defining a platform-specific definition of
  `kWarpSize` constant, and don't work without it. However, putting them
  in `srcs` makes some internal test builds attempt to compile the
  header by itself - which fails because it's not self-sufficient.
- Add missing `compatible_with = "//buildenv/target:non_prod"`.

PiperOrigin-RevId: 939902012
…lysis.

The include and build dependency for tsl/platform/errors are not used in hlo_dataflow_analysis.

PiperOrigin-RevId: 939932250
PiperOrigin-RevId: 939946290
@pull pull Bot locked and limited conversation to collaborators Jun 29, 2026
@pull pull Bot added the ⤵️ pull label Jun 29, 2026
@pull pull Bot merged commit 705879d into GesuBackups:master Jun 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants