[EP] Add SM80/A100 build support with SM90 feature guards by bkpathak · Pull Request #963 · uccl-project/uccl

bkpathak · 2026-05-23T20:06:27Z

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

[x ] I have run format.sh to follow the style guidelines.
I have run build.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

##Test-bed

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.159.03             Driver Version: 580.159.03     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:06:00.0 Off |                    0 |
| N/A   29C    P0             47W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
NVIDIA A100-SXM4-40GB
(8, 0)
nvcc not found
Linux 129-146-53-27 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
NAME="Ubuntu"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
Python 3.12.13
PyTorch: 2.5.1+cu121

Build Command

TORCH_CUDA_ARCH_LIST="8.0" \
DISABLE_SM90_FEATURES=1 \
DISABLE_AGGRESSIVE_PTX_INSTRS=1 \
PER_EXPERT_BATCHING=1 \
bash build.sh cu12 ep 3.12

Install Wheel

python3.12 -m pip install wheelhouse-cu12/uccl-0.1.1-cp312-abi3-manylinux_2_35_x86_64.whl --no-deps

Test

 python3.12 -c "
import torch
import uccl.ep

print('=== Hardware ===')
print('Device:', torch.cuda.get_device_name(0))
print('Capability:', torch.cuda.get_device_capability(0))
print('CUDA available:', torch.cuda.is_available())

print()
print('=== UCCL EP ===')
print('SM90 compiled:', uccl.ep.is_sm90_compiled())
print('Buffer:', uccl.ep.Buffer)
print('Config:', uccl.ep.Config)
print('APIs:', [x for x in dir(uccl.ep) if not x.startswith('_')])

print()
print('success')
"
=== Hardware ===
Device: NVIDIA A100-SXM4-40GB
Capability: (8, 0)
CUDA available: True

=== UCCL EP ===
SM90 compiled: False
Buffer: <class 'uccl.ep.Buffer'>
Config: <class 'uccl.ep.Config'>
APIs: ['Bench', 'BenchFifo', 'Buffer', 'Config', 'EnvInfo', 'EventHandle', 'EventOverlap', 'Fifo', 'FifoProxy', 'Proxy', 'Stats', 'alloc_cmd_ring', 'can_register_rdma_gpu_buffer', 'check_stream', 'connect_atomic_buffer', 'device_reset', 'free_cmd_ring', 'get_device', 'get_low_latency_rdma_size_hint', 'get_num_proxy_threads', 'get_oob_ip', 'get_rdma_buffer', 'has_proxy', 'is_sm90_compiled', 'launch_gpu_issue_kernel', 'rdma_buffer_should_use_host_alloc', 'register_proxies', 'register_proxy', 'set_device', 'stop_all_registered_proxies', 'stream_query', 'sync_stream', 'unregister_proxy']

success

YangZhou1997 · 2026-05-23T21:06:22Z

@manojgop does this fix the issues you mentioned in https://uccl-dev.slack.com/archives/C0B2FQHQNTT/p1779093594675089?thread_ts=1778975124.159549&cid=C0B2FQHQNTT? I do not have A100 machines in hand, thus cannot verify.

But I have tested on NV H100 + EFA and AMD MI325x + Broadcom, both working.

@MaoZiming @zhenhuang12, can any of you also review it?

manojgop · 2026-05-28T07:55:32Z

@manojgop does this fix the issues you mentioned in https://uccl-dev.slack.com/archives/C0B2FQHQNTT/p1779093594675089?thread_ts=1778975124.159549&cid=C0B2FQHQNTT? I do not have A100 machines in hand, thus cannot verify.

But I have tested on NV H100 + EFA and AMD MI325x + Broadcom, both working.

@MaoZiming @zhenhuang12, can any of you also review it?

@bkpathak I've fixed the build for cu13 and added additional changes to verify A100 with low latency mode for inter node and intra node in a two node setup with two A100 GPU per node. You may take this commit and update the PR #963 --> manojgop@f9dab2e. Note: I've tested this with Intel RDMA NIC

bkpathak · 2026-05-28T14:42:15Z

Sure, will send the updated PR.

- Add software grid barrier (cuda_grid_barrier) for non-SM90 path using cooperative launch with atomicAdd/atomicExch reset pattern - Use cudaLaunchKernelEx with cooperative attribute for non-SM90 GPUs - Guard TMA intrinsics with DISABLE_SM90_FEATURES to avoid ISA errors - Add warp-copy fallback for TMA store paths on pre-SM90 - Handle CUDA 13+ cuda_fp8.h include compatibility in ep_configs.cuh Signed-off-by: Manoj Gopalakrishnan <manoj.gopalakrishnan@intel.com>

manojgop · 2026-06-01T03:09:07Z

@YangZhou1997 @MaoZiming - Could you please review this PR?. @bkpathak This PR needs to be rebased with main branch

MaoZiming · 2026-06-02T04:29:42Z

+
+    // Acquire fence: ensure we see all writes from all blocks that arrived
+    // before us (transitively through the atomic total order).
+    __threadfence();


The __threadfence(); here is not a acquire fence? Is it needed

MaoZiming · 2026-06-02T04:34:48Z

@bkpathak Thank you for the contribution! would you also like to share some performance results on A100?

bkpathak · 2026-06-04T03:47:18Z

Sure, will add the test results.

[EP] Add SM80/A100 build support with SM90 feature guards

dae59bb

bkpathak requested review from MaoZiming, YangZhou1997 and zhenhuang12 as code owners May 23, 2026 20:06

YangZhou1997 added run-benchmark and removed run-benchmark labels May 23, 2026

Merge branch 'main' into issue_951

cbfb406

Merge branch 'main' into issue_951

c7afa7b

MaoZiming reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EP] Add SM80/A100 build support with SM90 feature guards#963

[EP] Add SM80/A100 build support with SM90 feature guards#963
bkpathak wants to merge 4 commits into
uccl-project:mainfrom
bkpathak:issue_951

bkpathak commented May 23, 2026

Uh oh!

YangZhou1997 commented May 23, 2026

Uh oh!

manojgop commented May 28, 2026 •

edited

Loading

Uh oh!

bkpathak commented May 28, 2026

Uh oh!

manojgop commented Jun 1, 2026

Uh oh!

MaoZiming Jun 2, 2026 •

edited

Loading

Uh oh!

MaoZiming commented Jun 2, 2026

Uh oh!

bkpathak commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

bkpathak commented May 23, 2026

Description

Type of Change

How Has This Been Tested?

Checklist

Build Command

Install Wheel

Test

Uh oh!

YangZhou1997 commented May 23, 2026

Uh oh!

manojgop commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkpathak commented May 28, 2026

Uh oh!

manojgop commented Jun 1, 2026

Uh oh!

MaoZiming Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaoZiming commented Jun 2, 2026

Uh oh!

bkpathak commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

manojgop commented May 28, 2026 •

edited

Loading

MaoZiming Jun 2, 2026 •

edited

Loading