When moving tensors between devices, we should leverage collective operations (e.g., send/recv) to transfer data directly between devices whenever possible. Currently, transferring a tensor to another device may require materializing it on the CPU as TensorData, which introduces unnecessary overhead. By using collective operations for device-to-device transfers, we can avoid this intermediate CPU materialization and improve performance.
When moving tensors between devices, we should leverage collective operations (e.g., send/recv) to transfer data directly between devices whenever possible. Currently, transferring a tensor to another device may require materializing it on the CPU as
TensorData, which introduces unnecessary overhead. By using collective operations for device-to-device transfers, we can avoid this intermediate CPU materialization and improve performance.