Skip to content

[pull] master from tensorflow:master#1675

Merged
pull[bot] merged 25 commits into
GesuBackups:masterfrom
tensorflow:master
Mar 30, 2026
Merged

[pull] master from tensorflow:master#1675
pull[bot] merged 25 commits into
GesuBackups:masterfrom
tensorflow:master

Conversation

@pull

@pull pull Bot commented Mar 30, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

vwbaker and others added 25 commits March 30, 2026 00:50
…of our goal to combine the two passes in order to start making fusion decisions based on the (semi)new tiling infrastructure.

PiperOrigin-RevId: 891555265
It is best practice to have the attribute on the declaration.

PiperOrigin-RevId: 891561612
…for XLA

Imported from GitHub PR openxla/xla#39497

📝 Summary of Changes
This PR adds skeleton code to register oneCCL oneAPI collectives for Intel GPUs, as a collective backend when XLA is built with SYCL.

🎯 Justification
This is needed to pass unit tests with the error "XLA compiled without GPU collectives support". Subsequent PRs will add full support for oneCCL collectives.

🚀 Kind of Contribution
✨ New Feature

Copybara import of the project:

--
076fec56774ba99dbc227220b420dd7a59e20f18 by nhatle <nhat.le@intel.com>:

Register oneCCL collectives

--
6b219341afdf095223ae04dcb97646edcea4857a by nhatle <nhat.le@intel.com>:

Match header imports

Merging this change closes #39497

PiperOrigin-RevId: 891575134
This mode allows concurrency in command buffer regions with latency-bound kernels that are too small to utilize the GPU well individually.

PiperOrigin-RevId: 891578204
PiperOrigin-RevId: 891583437
Imported from GitHub PR openxla/xla#39025

The previous implementation used a pointer mark bit to gain exclusive access during pop, which caused all concurrent push/pop operations to spin-wait — making it obstruction-free, not lock-free. Replace with a version counter packed into the upper 12 bits of the head pointer (52-bit VA, safe for both x86-64 and ARM64). Pop and push are now single-CAS operations where no thread can block another.

```
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
BM_GetOrCreate                                          8.37 ns         8.37 ns     83134582 items_per_second=119.519M/s
BM_GetOrCreateUnderContention/1/1000/process_time      19335 ns        25475 ns        27939 items_per_second=39.2536M/s
BM_GetOrCreateUnderContention/2/1000/process_time      33608 ns        67020 ns         9940 items_per_second=29.8416M/s
BM_GetOrCreateUnderContention/4/1000/process_time     251662 ns       781488 ns          829 items_per_second=5.11844M/s
BM_GetOrCreateUnderContention/8/1000/process_time    1025087 ns      6448060 ns          122 items_per_second=1.24068M/s
```
Copybara import of the project:

--
392de81997b1622557e36a4219fc0f7369e66a28 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla] Fil potentially expensive spin in ObjectPool

--
242b89e15ab4a1c848df6e460dbd1d2a1917e2c4 by Eugene Zhulenev <ezhulenev@openxla.org>:

Add tests for pointer tagging

Merging this change closes #39025

PiperOrigin-RevId: 891583663
PiperOrigin-RevId: 891584360
Imported from GitHub PR openxla/xla#39956

It is very confusing to read crash dumps when thunks execution order is reversed. Always print everything in chronological order.
Copybara import of the project:

--
84177ebd3e6753dedfc4ac12fdbea2819c0d70fd by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:gpu] Log thunk progress in chronological order

Merging this change closes #39956

PiperOrigin-RevId: 891586785
…or HIP command buffer

Imported from GitHub PR openxla/xla#39417

## 📋 Summary of Changes

- Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs.
- Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors.
- Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation.
- Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty.

## 🎯 Justification

When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs.

## 🚀 Kind of Contribution

🐛 Bug Fix, ✨ New Feature, ✏️ Tests

## ✏️ Unit Tests

Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`:
- `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch
- `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`)
- `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation)
- `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization

## ✏️ Execution Tests

Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X:

```
[==========] Running 4 tests from 1 test suite.
[ RUN      ] GpuCommandBufferTest.EmptyNodeOnly
[       OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeChain
[       OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeWithKernel
[       OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier
[       OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms)
[  PASSED  ] 4 tests.

Full suite: 8 passed, 0 failed, 35 skipped
```

Copybara import of the project:

--
97993f2bc987609f4a859ce1a516cabd54beb50d by scxfjiang <sc.xfjiang@gmail.com>:

fix collective crash

--
64aada3096869ec527b148b967029202c0998c9b by Pham Binh <phambinh@amd.com>:

Add empty graph safeguards and tests for HIP command buffer

- Add hipGraphGetRootNodes to the ROCm driver wrapper
- Add empty traced graph detection in RocmCommandBuffer::Trace() to
  prevent crashes when a captured stream produces no operations,
  matching the existing CUDA guard
- Implement PrepareFinalization to insert an empty node into empty body
  graphs, preventing potential issues when graph conditionals are
  supported on HIP in the future
- Add four new platform-agnostic test cases for CreateEmptyNode:
  EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and
  EmptyNodeAsDependencyBarrier

Merging this change closes #39417

PiperOrigin-RevId: 891593519
Imported from GitHub PR openxla/xla#39508

Some SYCL tests use a hardcoded spirv-binary when loading/running kernels. This PR replaces it with HLO based spirv-binary so that they are consistent with other SYCL tests (`sycl_executor_test` and `sycl_stream_test`).
Copybara import of the project:

--
9e66609eb534ae9e2457eb2a9e76438c90a52656 by Bhavani Subramanian <bhavani1.subramanian@intel.com>:

Replace hardcoded spirv-binary with HLO in 'stream_executor/sycl' tests

--
8d9586a6243cdef75c1a864cc305df10473bf1a4 by Bhavani Subramanian <bhavani1.subramanian@intel.com>:

Clean up deps in sycl/BUILD.

--
aca283a658037bc7fa02733f9bf0d91e4867e250 by Bhavani Subramanian <bhavani1.subramanian@intel.com>:

Clean up more deps in sycl/BUILD

Merging this change closes #39508

PiperOrigin-RevId: 891593787
…ed code paths

Imported from GitHub PR openxla/xla#39931

## Summary
- Add test for all-gather with no degenerate dims (no-op case)
- Add test for only minor degenerate dims (after gather dim)
- Add test for multiple major degenerate dims (before gather dim)
- Add test for gather dim at index 0 with minor degenerate dims

These cover previously untested code paths in `all_gather_remove_degenerate_dims.cc`.

## Test plan
- [x] All 10 tests pass (`bazel test //xla/hlo/transforms/collectives:all_gather_remove_degenerate_dims_test`)
- [x] Tests are hardware-independent (use `HloHardwareIndependentTestBase`)
- [x] Follow existing test patterns and Google C++ style
Copybara import of the project:

--
3fd46a203045f3bff5d46db8a19d582f57fb7ee9 by Manish Reddy <kreddy.manish@gmail.com>:

Add test coverage for AllGatherRemoveDegenerateDims untested code paths.

Merging this change closes #39931

PiperOrigin-RevId: 891593797
PiperOrigin-RevId: 891601145
And add an explicit check for input shape validity in
CudnnSupport::CudnnReorderConvolutionFilterAndBias.

In `cudnnReorderFilterAndBias` [1], `filterDesc` argument describes the
shapes of both filter and bias, with the bias vector size being assumed
to be equal to first dimension of the filter descriptor (number of
output channels).

Before this CL, the test runs with varying sizes of bias (32/64/96), but
fixed filter with 32 output channels. This makes
`cudnnReorderFilterAndBias` only initialize the first 32 elements of its
output, which by some miracle was good enough for the test to pass in a
lot of cases.

[1] https://docs.nvidia.com/deeplearning/cudnn/backend/latest/api/cudnn-cnn-library.html#cudnnreorderfilterandbias

PiperOrigin-RevId: 891613801
Because the copies added on unflattenner is not compatible to GSPMD. [1]

The keepHloShardingConstraints option is true only for this fallback. In the regular JAX to Shardy path this option is false. Hence it keeps unflattenner for the regular path.

PiperOrigin-RevId: 891638360
…t comparison

Imported from GitHub PR openxla/xla#39950

📝 Summary of Changes

- Fix autotuner workspace buffer comparison: restore extracting only the first tuple element from profiler output, excluding the workspace scratch buffer whose contents are non-deterministic across algorithm runs. Without this, all hipblasLt autotuning candidates are rejected as "WRONG RESULTS". (Regressed by  698df50c9e)
 - Add kI32 and kF64 computation types to AsHipblasComputeType() in hip_blas_utils.cc, matching the CUDA equivalent AsCublasComputeType(). Previously these hit LOG(FATAL), crashing on Int8 or F64 GEMMs routed through hipblasLt by the new autotuner.
 - Add Int8 TYPED_MATMUL entries (HIP_R_8I/HIP_R_32I) to the hipblasLt matmul execution dispatch in hip_blas_lt.cc, matching cuda_blas_lt.cc. Without these, Int8 GEMMs failed at execution time with "Unexpected dtype".

- //xla/backends/gpu/transforms:gemm_rewriter_test_amdgpu_any

Copybara import of the project:

--
8d6ab8ebfffc033d185d5d26134de8427f567845 by cj401-amd <chunyjin@amd.com>:

update for gemm_rewriter_test_amdgpu

--
b53c25d59acb937d295796f56c0bd2fe945187ab by cj401-amd <chunyjin@amd.com>:

update for gemm_rewriter_test

Merging this change closes #39950

PiperOrigin-RevId: 891640715
…Collection

Fusion analysis cache and indexing cost analysis use non-owning references to the device info. The cost model stats pass uses a copy, so its members should use this copy instead of a reference passed to the pass ctor.

PiperOrigin-RevId: 891640955
This new class tests with both values of `xla_allow_excess_precision` flag (true and false).

PiperOrigin-RevId: 891642150
…orms.

It is only used on GPU. Also only run the test on GPU. Given that the pass is
not added to the CPU pipeline, this will not provide any testing of the pass
itself.

PiperOrigin-RevId: 891651911
This is a naive implementation that simply calculates stats for dots separately and adds them into the totals.

PiperOrigin-RevId: 891655605
The "TFRT" is vestigial and confusing. The new string matches how we report GPUs (e.g., cuda:0).

PiperOrigin-RevId: 891681679
@pull pull Bot locked and limited conversation to collaborators Mar 30, 2026
@pull pull Bot added the ⤵️ pull label Mar 30, 2026
@pull pull Bot merged commit e62b00d into GesuBackups:master Mar 30, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.