https://github.com/pytorch/pytorch/pull/62140 "grouped comm on a set of unflattened tensors can be more performant than flattening+a single flat nccl call."
pytorch/pytorch#62140
"grouped comm on a set of unflattened tensors can be more performant than flattening+a single flat nccl call."