[pull] master from tensorflow:master#1683
Merged
Merged
Conversation
PiperOrigin-RevId: 892765437
This makes sure that instances of GpuTopology are always created with a target config attached in StreamExecutorGpuPjRtClient. It also makes sure that these fields are transferred through the C API layer via PjRt attributes. This change is a requirement for the unified compilation path of AOT and JIT compilation. PiperOrigin-RevId: 892773575
PiperOrigin-RevId: 892776705
…odule execution on devices is completed. PiperOrigin-RevId: 892786341
…L memfill/memset tests Imported from GitHub PR openxla/xla#40180 This PR contains the following changes: 1. Redundant context activation code is removed since SYCL contexts do not require explicit activation. 2. SYCL memfill and memset functions (both sync and async versions) expect count (i.e. number of elements) instead of bytes when filling device memory. The corresponding tests have been updated. Copybara import of the project: -- 640db293e76b5b0af3d13e655282394605d42d08 by Bhavani Subramanian <bhavani1.subramanian@intel.com>: Use count instead of bytes in SyclMemsetDevice(Async) and SyclMemfillDevice(Async) tests -- 4148115bf39d310722da5de4c2185af174c9d08d by Bhavani Subramanian <bhavani1.subramanian@intel.com>: Remove context activation from 'stream_executor/sycl' code since it is not used Merging this change closes #40180 PiperOrigin-RevId: 892786832
gpu_device_info_test is mainly testing whether the embedded target configs are up to date, therefore the test should live with the target config logic. PiperOrigin-RevId: 892787083
…yout calculation. The scan rewriter now correctly identifies zero initial values even when they are wrapped in a broadcast. Additionally, the logic for calculating `vector_length` and `column_length` based on the minor-to-major layout has been corrected. PiperOrigin-RevId: 892813374
…source. Imported from GitHub PR openxla/xla#33269 📝 Summary of Changes - Addin a knob to control the limitation of async-compute resource. This switch provides ample flexibility for control, enabling more asynchronous computations to execute concurrently. In host-offloading experiments, increasing this value effectively overlaps device-to-host (D2H) transfers with other computations, resulting in improved performance. 🎯 Justification This could allow users to control how many in-fligh async-computation in LHS. 🚀 Kind of Contribution ✨ New Feature, Copybara import of the project: -- b84bbf195006686ee54b0e5ed791f8d8d024d1eb by Ming Huang <mingh@nvidia.com>: Add a flag to control the limitation of compute resource. -- 308f3be1073dc00e9df54a16718b82bbc9361e45 by Ming Huang <mingh@nvidia.com>: Adding a comment to xla_gpu_experimental_parallel_async_compute_limit Merging this change closes #33269 PiperOrigin-RevId: 892814843
PiperOrigin-RevId: 892818671
…args Imported from GitHub PR openxla/xla#39601 Instead of hardcoding NCCL device comm size for passing it to device kernels add support for dynamically sized storage for packed arguments to SE Kernel API. NCCL device comm size keep changing from release to release, and instead of breaking XLA on every update it should be dynamic. Copybara import of the project: -- 227faa25ba5b670f22fd2f6ed67041eba66d1e69 by Eugene Zhulenev <ezhulenev@openxla.org>: [xla:gpu] Add support for dynamically sized packed kernel args Merging this change closes #39601 PiperOrigin-RevId: 892823919
PiperOrigin-RevId: 892829493
In this part: drop using xla utility from xla shardy and instead use the one from mlir shardy. Pure refactoring. As part of preparing to move the in/outliner into shardy, that is, from xla::sdy into mlir::sdy. Because xla::sdy and mlir::sdy are separately managed repos, this move can not be in one go. PiperOrigin-RevId: 892829743
Imported from GitHub PR openxla/xla#39854 📝 Summary of Changes Use hermetic llvm dependency to compile xla under rocm 🎯 Justification Get rid of the dependency of the installed llvm, instead use hermetic llvm and provided with it clang to compile xla 🚀 Kind of Contribution Please remove what does not apply: ✨ New Feature 📊 Benchmark (for Performance Improvements) Not relevant 🧪 Unit Tests: CI checks 🧪 Execution Tests: CI checks Copybara import of the project: -- 70595dbf5a8ee623ac581bc10a4a57af36402336 by Alexandros Theodoridis <atheodor@amd.com>: Use hermetic clang for rocm -- fc313ba69715587709493014a564fcc0f8f4975f by Alexandros Theodoridis <atheodor@amd.com>: Fix jax build -- 5ab1657b439f92d226e2cbe491145f72fbf0b9a8 by Alexandros Theodoridis <atheodor@amd.com>: Trigger CI/CD pipeline -- 188acde374789e7bfefd31951acbc3574fa9ec24 by Alexandros Theodoridis <atheodor@amd.com>: Switch to use ml toolchain repo for local_clang Merging this change closes #39854 PiperOrigin-RevId: 892840096
Codegen checks are currently spread across 3 files. This change unifies it one place as much as possible. PiperOrigin-RevId: 892845059
…fer scheduling mode `CONCURRENT_REGIONS`. PiperOrigin-RevId: 892857796
This is needed to correctly emit ops with the runtime variables, e.g. dynamic slice. If we have dyn-slice(parameter) and we want to emit the load from the parameter, then the RT variable should have been emitted before, because it is a part of the offset computation for the load. PiperOrigin-RevId: 892860392
PiperOrigin-RevId: 892863362
…ided. PiperOrigin-RevId: 892867437
PiperOrigin-RevId: 892871265
Imported from GitHub PR openxla/xla#39725 There is no need to create hang watchdogs for different part of XLA, just one per-process instance is enough for most of the use cases. Copybara import of the project: -- 8109d40d86c137bb54566317c8c187e025be2ddc by Eugene Zhulenev <ezhulenev@openxla.org>: [xla] Add global per-process hang watchdog Merging this change closes #39725 PiperOrigin-RevId: 892873909
PiperOrigin-RevId: 892892516
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )