[pull] master from tensorflow:master by pull[bot] · Pull Request #1675 · GesuBackups/tensorflow

pull · 2026-03-30T13:29:16Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…of our goal to combine the two passes in order to start making fusion decisions based on the (semi)new tiling infrastructure. PiperOrigin-RevId: 891555265

…Lit Tests. PiperOrigin-RevId: 891555658

It is best practice to have the attribute on the declaration. PiperOrigin-RevId: 891561612

Reverts 2dd9641 PiperOrigin-RevId: 891564537

…for XLA Imported from GitHub PR openxla/xla#39497 📝 Summary of Changes This PR adds skeleton code to register oneCCL oneAPI collectives for Intel GPUs, as a collective backend when XLA is built with SYCL. 🎯 Justification This is needed to pass unit tests with the error "XLA compiled without GPU collectives support". Subsequent PRs will add full support for oneCCL collectives. 🚀 Kind of Contribution ✨ New Feature Copybara import of the project: -- 076fec56774ba99dbc227220b420dd7a59e20f18 by nhatle <nhat.le@intel.com>: Register oneCCL collectives -- 6b219341afdf095223ae04dcb97646edcea4857a by nhatle <nhat.le@intel.com>: Match header imports Merging this change closes #39497 PiperOrigin-RevId: 891575134

This mode allows concurrency in command buffer regions with latency-bound kernels that are too small to utilize the GPU well individually. PiperOrigin-RevId: 891578204

PiperOrigin-RevId: 891583437

Imported from GitHub PR openxla/xla#39025 The previous implementation used a pointer mark bit to gain exclusive access during pop, which caused all concurrent push/pop operations to spin-wait — making it obstruction-free, not lock-free. Replace with a version counter packed into the upper 12 bits of the head pointer (52-bit VA, safe for both x86-64 and ARM64). Pop and push are now single-CAS operations where no thread can block another. ``` ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ BM_GetOrCreate 8.37 ns 8.37 ns 83134582 items_per_second=119.519M/s BM_GetOrCreateUnderContention/1/1000/process_time 19335 ns 25475 ns 27939 items_per_second=39.2536M/s BM_GetOrCreateUnderContention/2/1000/process_time 33608 ns 67020 ns 9940 items_per_second=29.8416M/s BM_GetOrCreateUnderContention/4/1000/process_time 251662 ns 781488 ns 829 items_per_second=5.11844M/s BM_GetOrCreateUnderContention/8/1000/process_time 1025087 ns 6448060 ns 122 items_per_second=1.24068M/s ``` Copybara import of the project: -- 392de81997b1622557e36a4219fc0f7369e66a28 by Eugene Zhulenev <ezhulenev@openxla.org>: [xla] Fil potentially expensive spin in ObjectPool -- 242b89e15ab4a1c848df6e460dbd1d2a1917e2c4 by Eugene Zhulenev <ezhulenev@openxla.org>: Add tests for pointer tagging Merging this change closes #39025 PiperOrigin-RevId: 891583663

PiperOrigin-RevId: 891584360

Imported from GitHub PR openxla/xla#39956 It is very confusing to read crash dumps when thunks execution order is reversed. Always print everything in chronological order. Copybara import of the project: -- 84177ebd3e6753dedfc4ac12fdbea2819c0d70fd by Eugene Zhulenev <ezhulenev@openxla.org>: [xla:gpu] Log thunk progress in chronological order Merging this change closes #39956 PiperOrigin-RevId: 891586785

…or HIP command buffer Imported from GitHub PR openxla/xla#39417 ## 📋 Summary of Changes - Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs. - Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors. - Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation. - Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty. ## 🎯 Justification When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs. ## 🚀 Kind of Contribution 🐛 Bug Fix, ✨ New Feature, ✏️ Tests ## ✏️ Unit Tests Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`: - `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch - `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`) - `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation) - `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization ## ✏️ Execution Tests Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X: ``` [==========] Running 4 tests from 1 test suite. [ RUN ] GpuCommandBufferTest.EmptyNodeOnly [ OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeChain [ OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeWithKernel [ OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier [ OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms) [ PASSED ] 4 tests. Full suite: 8 passed, 0 failed, 35 skipped ``` Copybara import of the project: -- 97993f2bc987609f4a859ce1a516cabd54beb50d by scxfjiang <sc.xfjiang@gmail.com>: fix collective crash -- 64aada3096869ec527b148b967029202c0998c9b by Pham Binh <phambinh@amd.com>: Add empty graph safeguards and tests for HIP command buffer - Add hipGraphGetRootNodes to the ROCm driver wrapper - Add empty traced graph detection in RocmCommandBuffer::Trace() to prevent crashes when a captured stream produces no operations, matching the existing CUDA guard - Implement PrepareFinalization to insert an empty node into empty body graphs, preventing potential issues when graph conditionals are supported on HIP in the future - Add four new platform-agnostic test cases for CreateEmptyNode: EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and EmptyNodeAsDependencyBarrier Merging this change closes #39417 PiperOrigin-RevId: 891593519

Imported from GitHub PR openxla/xla#39508 Some SYCL tests use a hardcoded spirv-binary when loading/running kernels. This PR replaces it with HLO based spirv-binary so that they are consistent with other SYCL tests (`sycl_executor_test` and `sycl_stream_test`). Copybara import of the project: -- 9e66609eb534ae9e2457eb2a9e76438c90a52656 by Bhavani Subramanian <bhavani1.subramanian@intel.com>: Replace hardcoded spirv-binary with HLO in 'stream_executor/sycl' tests -- 8d9586a6243cdef75c1a864cc305df10473bf1a4 by Bhavani Subramanian <bhavani1.subramanian@intel.com>: Clean up deps in sycl/BUILD. -- aca283a658037bc7fa02733f9bf0d91e4867e250 by Bhavani Subramanian <bhavani1.subramanian@intel.com>: Clean up more deps in sycl/BUILD Merging this change closes #39508 PiperOrigin-RevId: 891593787

…ed code paths Imported from GitHub PR openxla/xla#39931 ## Summary - Add test for all-gather with no degenerate dims (no-op case) - Add test for only minor degenerate dims (after gather dim) - Add test for multiple major degenerate dims (before gather dim) - Add test for gather dim at index 0 with minor degenerate dims These cover previously untested code paths in `all_gather_remove_degenerate_dims.cc`. ## Test plan - [x] All 10 tests pass (`bazel test //xla/hlo/transforms/collectives:all_gather_remove_degenerate_dims_test`) - [x] Tests are hardware-independent (use `HloHardwareIndependentTestBase`) - [x] Follow existing test patterns and Google C++ style Copybara import of the project: -- 3fd46a203045f3bff5d46db8a19d582f57fb7ee9 by Manish Reddy <kreddy.manish@gmail.com>: Add test coverage for AllGatherRemoveDegenerateDims untested code paths. Merging this change closes #39931 PiperOrigin-RevId: 891593797

PiperOrigin-RevId: 891601145

PiperOrigin-RevId: 891609064

And add an explicit check for input shape validity in CudnnSupport::CudnnReorderConvolutionFilterAndBias. In `cudnnReorderFilterAndBias` [1], `filterDesc` argument describes the shapes of both filter and bias, with the bias vector size being assumed to be equal to first dimension of the filter descriptor (number of output channels). Before this CL, the test runs with varying sizes of bias (32/64/96), but fixed filter with 32 output channels. This makes `cudnnReorderFilterAndBias` only initialize the first 32 elements of its output, which by some miracle was good enough for the test to pass in a lot of cases. [1] https://docs.nvidia.com/deeplearning/cudnn/backend/latest/api/cudnn-cnn-library.html#cudnnreorderfilterandbias PiperOrigin-RevId: 891613801

Because the copies added on unflattenner is not compatible to GSPMD. [1] The keepHloShardingConstraints option is true only for this fallback. In the regular JAX to Shardy path this option is false. Hence it keeps unflattenner for the regular path. PiperOrigin-RevId: 891638360

…t comparison Imported from GitHub PR openxla/xla#39950 📝 Summary of Changes - Fix autotuner workspace buffer comparison: restore extracting only the first tuple element from profiler output, excluding the workspace scratch buffer whose contents are non-deterministic across algorithm runs. Without this, all hipblasLt autotuning candidates are rejected as "WRONG RESULTS". (Regressed by 698df50c9e) - Add kI32 and kF64 computation types to AsHipblasComputeType() in hip_blas_utils.cc, matching the CUDA equivalent AsCublasComputeType(). Previously these hit LOG(FATAL), crashing on Int8 or F64 GEMMs routed through hipblasLt by the new autotuner. - Add Int8 TYPED_MATMUL entries (HIP_R_8I/HIP_R_32I) to the hipblasLt matmul execution dispatch in hip_blas_lt.cc, matching cuda_blas_lt.cc. Without these, Int8 GEMMs failed at execution time with "Unexpected dtype". - //xla/backends/gpu/transforms:gemm_rewriter_test_amdgpu_any Copybara import of the project: -- 8d6ab8ebfffc033d185d5d26134de8427f567845 by cj401-amd <chunyjin@amd.com>: update for gemm_rewriter_test_amdgpu -- b53c25d59acb937d295796f56c0bd2fe945187ab by cj401-amd <chunyjin@amd.com>: update for gemm_rewriter_test Merging this change closes #39950 PiperOrigin-RevId: 891640715

…Collection Fusion analysis cache and indexing cost analysis use non-owning references to the device info. The cost model stats pass uses a copy, so its members should use this copy instead of a reference passed to the pass ctor. PiperOrigin-RevId: 891640955

PiperOrigin-RevId: 891640988

This new class tests with both values of `xla_allow_excess_precision` flag (true and false). PiperOrigin-RevId: 891642150

…orms. It is only used on GPU. Also only run the test on GPU. Given that the pass is not added to the CPU pipeline, this will not provide any testing of the pass itself. PiperOrigin-RevId: 891651911

This is a naive implementation that simply calculates stats for dots separately and adds them into the totals. PiperOrigin-RevId: 891655605

PiperOrigin-RevId: 891661338

The "TFRT" is vestigial and confusing. The new string matches how we report GPUs (e.g., cuda:0). PiperOrigin-RevId: 891681679

vwbaker and others added 25 commits March 30, 2026 00:50

Hoist HoistFusedBitcasts above GemmFusionSwapOperands - this is part …

fa536d6

…of our goal to combine the two passes in order to start making fusion decisions based on the (semi)new tiling infrastructure. PiperOrigin-RevId: 891555265

[XLA:GPU] Migrate TmaAndLayoutParameterizedTritonEmitterTestSuite to …

4011747

…Lit Tests. PiperOrigin-RevId: 891555658

Move [[nodiscard]] attribute from definition to declaration.

5a9c1fa

It is best practice to have the attribute on the declaration. PiperOrigin-RevId: 891561612

[XLA:GPU]: Reapply flag flip for one shot all-reduce after fixes

7baf72c

Reverts 2dd9641 PiperOrigin-RevId: 891564537

[XLA:GPU] Add command buffer scheduling mode CONCURRENT_REGIONS.

56ae1db

This mode allows concurrency in command buffer regions with latency-bound kernels that are too small to utilize the GPU well individually. PiperOrigin-RevId: 891578204

Automated Code Change

df778d5

PiperOrigin-RevId: 891583437

Automated Code Change

4378602

PiperOrigin-RevId: 891584360

Automated Code Change

6741fb7

PiperOrigin-RevId: 891601145

[XLA:GPU] Add support for constants via new tiling.

afe47aa

PiperOrigin-RevId: 891609064

[XLA:GPU] Add transposes via the new tiling.

0ff3f3f

PiperOrigin-RevId: 891640988

[xla] Add FloatNormalizationExcessPrecisionTest

eb81b89

This new class tests with both values of `xla_allow_excess_precision` flag (true and false). PiperOrigin-RevId: 891642150

[XLA:GPU] Move ScatterDeterminismExpander pass to backends/gpu/transf…

c662557

…orms. It is only used on GPU. Also only run the test on GPU. Given that the pass is not added to the CPU pipeline, this will not provide any testing of the pass itself. PiperOrigin-RevId: 891651911

[XLA:GPU] Reroute dots to a detailed cost model in the indexing model

980b99c

This is a naive implementation that simply calculates stats for dots separately and adds them into the totals. PiperOrigin-RevId: 891655605

[XLA:GPU][NFC] Pad the tile sizes before the tiling propagation.

c1b8d76

PiperOrigin-RevId: 891661338

Make CPU devices report as cpu:0 instead of TFRT_CPU_0.

e62b00d

The "TFRT" is vestigial and confusing. The new string matches how we report GPUs (e.g., cuda:0). PiperOrigin-RevId: 891681679

pull Bot locked and limited conversation to collaborators Mar 30, 2026

pull Bot added the ⤵️ pull label Mar 30, 2026

pull Bot merged commit e62b00d into GesuBackups:master Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from tensorflow:master#1675

[pull] master from tensorflow:master#1675
pull[bot] merged 25 commits into
GesuBackups:masterfrom
tensorflow:master

pull Bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Uh oh!

Conversation

pull Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

pull Bot commented Mar 30, 2026 •

edited

Loading