[pull] master from tensorflow:master#1675
Merged
Merged
Conversation
…of our goal to combine the two passes in order to start making fusion decisions based on the (semi)new tiling infrastructure. PiperOrigin-RevId: 891555265
…Lit Tests. PiperOrigin-RevId: 891555658
It is best practice to have the attribute on the declaration. PiperOrigin-RevId: 891561612
Reverts 2dd9641 PiperOrigin-RevId: 891564537
…for XLA Imported from GitHub PR openxla/xla#39497 📝 Summary of Changes This PR adds skeleton code to register oneCCL oneAPI collectives for Intel GPUs, as a collective backend when XLA is built with SYCL. 🎯 Justification This is needed to pass unit tests with the error "XLA compiled without GPU collectives support". Subsequent PRs will add full support for oneCCL collectives. 🚀 Kind of Contribution ✨ New Feature Copybara import of the project: -- 076fec56774ba99dbc227220b420dd7a59e20f18 by nhatle <nhat.le@intel.com>: Register oneCCL collectives -- 6b219341afdf095223ae04dcb97646edcea4857a by nhatle <nhat.le@intel.com>: Match header imports Merging this change closes #39497 PiperOrigin-RevId: 891575134
This mode allows concurrency in command buffer regions with latency-bound kernels that are too small to utilize the GPU well individually. PiperOrigin-RevId: 891578204
PiperOrigin-RevId: 891583437
Imported from GitHub PR openxla/xla#39025 The previous implementation used a pointer mark bit to gain exclusive access during pop, which caused all concurrent push/pop operations to spin-wait — making it obstruction-free, not lock-free. Replace with a version counter packed into the upper 12 bits of the head pointer (52-bit VA, safe for both x86-64 and ARM64). Pop and push are now single-CAS operations where no thread can block another. ``` ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ BM_GetOrCreate 8.37 ns 8.37 ns 83134582 items_per_second=119.519M/s BM_GetOrCreateUnderContention/1/1000/process_time 19335 ns 25475 ns 27939 items_per_second=39.2536M/s BM_GetOrCreateUnderContention/2/1000/process_time 33608 ns 67020 ns 9940 items_per_second=29.8416M/s BM_GetOrCreateUnderContention/4/1000/process_time 251662 ns 781488 ns 829 items_per_second=5.11844M/s BM_GetOrCreateUnderContention/8/1000/process_time 1025087 ns 6448060 ns 122 items_per_second=1.24068M/s ``` Copybara import of the project: -- 392de81997b1622557e36a4219fc0f7369e66a28 by Eugene Zhulenev <ezhulenev@openxla.org>: [xla] Fil potentially expensive spin in ObjectPool -- 242b89e15ab4a1c848df6e460dbd1d2a1917e2c4 by Eugene Zhulenev <ezhulenev@openxla.org>: Add tests for pointer tagging Merging this change closes #39025 PiperOrigin-RevId: 891583663
PiperOrigin-RevId: 891584360
Imported from GitHub PR openxla/xla#39956 It is very confusing to read crash dumps when thunks execution order is reversed. Always print everything in chronological order. Copybara import of the project: -- 84177ebd3e6753dedfc4ac12fdbea2819c0d70fd by Eugene Zhulenev <ezhulenev@openxla.org>: [xla:gpu] Log thunk progress in chronological order Merging this change closes #39956 PiperOrigin-RevId: 891586785
…or HIP command buffer Imported from GitHub PR openxla/xla#39417 ## 📋 Summary of Changes - Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs. - Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors. - Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation. - Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty. ## 🎯 Justification When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs. ## 🚀 Kind of Contribution 🐛 Bug Fix, ✨ New Feature, ✏️ Tests ## ✏️ Unit Tests Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`: - `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch - `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`) - `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation) - `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization ## ✏️ Execution Tests Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X: ``` [==========] Running 4 tests from 1 test suite. [ RUN ] GpuCommandBufferTest.EmptyNodeOnly [ OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeChain [ OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeWithKernel [ OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier [ OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms) [ PASSED ] 4 tests. Full suite: 8 passed, 0 failed, 35 skipped ``` Copybara import of the project: -- 97993f2bc987609f4a859ce1a516cabd54beb50d by scxfjiang <sc.xfjiang@gmail.com>: fix collective crash -- 64aada3096869ec527b148b967029202c0998c9b by Pham Binh <phambinh@amd.com>: Add empty graph safeguards and tests for HIP command buffer - Add hipGraphGetRootNodes to the ROCm driver wrapper - Add empty traced graph detection in RocmCommandBuffer::Trace() to prevent crashes when a captured stream produces no operations, matching the existing CUDA guard - Implement PrepareFinalization to insert an empty node into empty body graphs, preventing potential issues when graph conditionals are supported on HIP in the future - Add four new platform-agnostic test cases for CreateEmptyNode: EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and EmptyNodeAsDependencyBarrier Merging this change closes #39417 PiperOrigin-RevId: 891593519
Imported from GitHub PR openxla/xla#39508 Some SYCL tests use a hardcoded spirv-binary when loading/running kernels. This PR replaces it with HLO based spirv-binary so that they are consistent with other SYCL tests (`sycl_executor_test` and `sycl_stream_test`). Copybara import of the project: -- 9e66609eb534ae9e2457eb2a9e76438c90a52656 by Bhavani Subramanian <bhavani1.subramanian@intel.com>: Replace hardcoded spirv-binary with HLO in 'stream_executor/sycl' tests -- 8d9586a6243cdef75c1a864cc305df10473bf1a4 by Bhavani Subramanian <bhavani1.subramanian@intel.com>: Clean up deps in sycl/BUILD. -- aca283a658037bc7fa02733f9bf0d91e4867e250 by Bhavani Subramanian <bhavani1.subramanian@intel.com>: Clean up more deps in sycl/BUILD Merging this change closes #39508 PiperOrigin-RevId: 891593787
…ed code paths Imported from GitHub PR openxla/xla#39931 ## Summary - Add test for all-gather with no degenerate dims (no-op case) - Add test for only minor degenerate dims (after gather dim) - Add test for multiple major degenerate dims (before gather dim) - Add test for gather dim at index 0 with minor degenerate dims These cover previously untested code paths in `all_gather_remove_degenerate_dims.cc`. ## Test plan - [x] All 10 tests pass (`bazel test //xla/hlo/transforms/collectives:all_gather_remove_degenerate_dims_test`) - [x] Tests are hardware-independent (use `HloHardwareIndependentTestBase`) - [x] Follow existing test patterns and Google C++ style Copybara import of the project: -- 3fd46a203045f3bff5d46db8a19d582f57fb7ee9 by Manish Reddy <kreddy.manish@gmail.com>: Add test coverage for AllGatherRemoveDegenerateDims untested code paths. Merging this change closes #39931 PiperOrigin-RevId: 891593797
PiperOrigin-RevId: 891601145
PiperOrigin-RevId: 891609064
And add an explicit check for input shape validity in CudnnSupport::CudnnReorderConvolutionFilterAndBias. In `cudnnReorderFilterAndBias` [1], `filterDesc` argument describes the shapes of both filter and bias, with the bias vector size being assumed to be equal to first dimension of the filter descriptor (number of output channels). Before this CL, the test runs with varying sizes of bias (32/64/96), but fixed filter with 32 output channels. This makes `cudnnReorderFilterAndBias` only initialize the first 32 elements of its output, which by some miracle was good enough for the test to pass in a lot of cases. [1] https://docs.nvidia.com/deeplearning/cudnn/backend/latest/api/cudnn-cnn-library.html#cudnnreorderfilterandbias PiperOrigin-RevId: 891613801
Because the copies added on unflattenner is not compatible to GSPMD. [1] The keepHloShardingConstraints option is true only for this fallback. In the regular JAX to Shardy path this option is false. Hence it keeps unflattenner for the regular path. PiperOrigin-RevId: 891638360
…t comparison Imported from GitHub PR openxla/xla#39950 📝 Summary of Changes - Fix autotuner workspace buffer comparison: restore extracting only the first tuple element from profiler output, excluding the workspace scratch buffer whose contents are non-deterministic across algorithm runs. Without this, all hipblasLt autotuning candidates are rejected as "WRONG RESULTS". (Regressed by 698df50c9e) - Add kI32 and kF64 computation types to AsHipblasComputeType() in hip_blas_utils.cc, matching the CUDA equivalent AsCublasComputeType(). Previously these hit LOG(FATAL), crashing on Int8 or F64 GEMMs routed through hipblasLt by the new autotuner. - Add Int8 TYPED_MATMUL entries (HIP_R_8I/HIP_R_32I) to the hipblasLt matmul execution dispatch in hip_blas_lt.cc, matching cuda_blas_lt.cc. Without these, Int8 GEMMs failed at execution time with "Unexpected dtype". - //xla/backends/gpu/transforms:gemm_rewriter_test_amdgpu_any Copybara import of the project: -- 8d6ab8ebfffc033d185d5d26134de8427f567845 by cj401-amd <chunyjin@amd.com>: update for gemm_rewriter_test_amdgpu -- b53c25d59acb937d295796f56c0bd2fe945187ab by cj401-amd <chunyjin@amd.com>: update for gemm_rewriter_test Merging this change closes #39950 PiperOrigin-RevId: 891640715
…Collection Fusion analysis cache and indexing cost analysis use non-owning references to the device info. The cost model stats pass uses a copy, so its members should use this copy instead of a reference passed to the pass ctor. PiperOrigin-RevId: 891640955
PiperOrigin-RevId: 891640988
This new class tests with both values of `xla_allow_excess_precision` flag (true and false). PiperOrigin-RevId: 891642150
…orms. It is only used on GPU. Also only run the test on GPU. Given that the pass is not added to the CPU pipeline, this will not provide any testing of the pass itself. PiperOrigin-RevId: 891651911
This is a naive implementation that simply calculates stats for dots separately and adds them into the totals. PiperOrigin-RevId: 891655605
PiperOrigin-RevId: 891661338
The "TFRT" is vestigial and confusing. The new string matches how we report GPUs (e.g., cuda:0). PiperOrigin-RevId: 891681679
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )