CUTLASS 4.5.0
#3231
Replies: 1 comment
-
|
we will have 4.5.1 to address quack issues. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
CuTe DSL
New features
block_copy()to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA byblock_copy()and need not to invoketma_partition(). And users can remove bulk of S2T initialization to simplify S2T copy.C.remap_modes[:, 0, 1]subscript syntax (where:marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.Bug fixing and improvements
More examples of authorizing peak-performance kernels
API changes
CUTLASS C++
ConstSubbyteReference__nv_atomic_load_nwithvolatilefor CUDA 11.4 compatibility in subbyte referencePipelineStorageshadowing in SM100 complex epilogueThis discussion was created from the release CUTLASS 4.5.0.
Beta Was this translation helpful? Give feedback.
All reactions