Skip to content

[EP] Add SM80/A100 build support with SM90 feature guards#963

Open
bkpathak wants to merge 4 commits into
uccl-project:mainfrom
bkpathak:issue_951
Open

[EP] Add SM80/A100 build support with SM90 feature guards#963
bkpathak wants to merge 4 commits into
uccl-project:mainfrom
bkpathak:issue_951

Conversation

@bkpathak

Copy link
Copy Markdown

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • [x ] I have run format.sh to follow the style guidelines.
  • I have run build.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

##Test-bed

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.159.03             Driver Version: 580.159.03     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:06:00.0 Off |                    0 |
| N/A   29C    P0             47W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
NVIDIA A100-SXM4-40GB
(8, 0)
nvcc not found
Linux 129-146-53-27 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
NAME="Ubuntu"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
Python 3.12.13
PyTorch: 2.5.1+cu121

Build Command

TORCH_CUDA_ARCH_LIST="8.0" \
DISABLE_SM90_FEATURES=1 \
DISABLE_AGGRESSIVE_PTX_INSTRS=1 \
PER_EXPERT_BATCHING=1 \
bash build.sh cu12 ep 3.12

Install Wheel

python3.12 -m pip install wheelhouse-cu12/uccl-0.1.1-cp312-abi3-manylinux_2_35_x86_64.whl --no-deps

Test

 python3.12 -c "
import torch
import uccl.ep

print('=== Hardware ===')
print('Device:', torch.cuda.get_device_name(0))
print('Capability:', torch.cuda.get_device_capability(0))
print('CUDA available:', torch.cuda.is_available())

print()
print('=== UCCL EP ===')
print('SM90 compiled:', uccl.ep.is_sm90_compiled())
print('Buffer:', uccl.ep.Buffer)
print('Config:', uccl.ep.Config)
print('APIs:', [x for x in dir(uccl.ep) if not x.startswith('_')])

print()
print('success')
"
=== Hardware ===
Device: NVIDIA A100-SXM4-40GB
Capability: (8, 0)
CUDA available: True

=== UCCL EP ===
SM90 compiled: False
Buffer: <class 'uccl.ep.Buffer'>
Config: <class 'uccl.ep.Config'>
APIs: ['Bench', 'BenchFifo', 'Buffer', 'Config', 'EnvInfo', 'EventHandle', 'EventOverlap', 'Fifo', 'FifoProxy', 'Proxy', 'Stats', 'alloc_cmd_ring', 'can_register_rdma_gpu_buffer', 'check_stream', 'connect_atomic_buffer', 'device_reset', 'free_cmd_ring', 'get_device', 'get_low_latency_rdma_size_hint', 'get_num_proxy_threads', 'get_oob_ip', 'get_rdma_buffer', 'has_proxy', 'is_sm90_compiled', 'launch_gpu_issue_kernel', 'rdma_buffer_should_use_host_alloc', 'register_proxies', 'register_proxy', 'set_device', 'stop_all_registered_proxies', 'stream_query', 'sync_stream', 'unregister_proxy']

success

@YangZhou1997

Copy link
Copy Markdown
Member

@manojgop does this fix the issues you mentioned in https://uccl-dev.slack.com/archives/C0B2FQHQNTT/p1779093594675089?thread_ts=1778975124.159549&cid=C0B2FQHQNTT? I do not have A100 machines in hand, thus cannot verify.

But I have tested on NV H100 + EFA and AMD MI325x + Broadcom, both working.

@MaoZiming @zhenhuang12, can any of you also review it?

@manojgop

manojgop commented May 28, 2026

Copy link
Copy Markdown
Contributor

@manojgop does this fix the issues you mentioned in https://uccl-dev.slack.com/archives/C0B2FQHQNTT/p1779093594675089?thread_ts=1778975124.159549&cid=C0B2FQHQNTT? I do not have A100 machines in hand, thus cannot verify.

But I have tested on NV H100 + EFA and AMD MI325x + Broadcom, both working.

@MaoZiming @zhenhuang12, can any of you also review it?

@bkpathak I've fixed the build for cu13 and added additional changes to verify A100 with low latency mode for inter node and intra node in a two node setup with two A100 GPU per node. You may take this commit and update the PR #963 --> manojgop@f9dab2e. Note: I've tested this with Intel RDMA NIC

@bkpathak

Copy link
Copy Markdown
Author

Sure, will send the updated PR.

- Add software grid barrier (cuda_grid_barrier) for non-SM90 path using
  cooperative launch with atomicAdd/atomicExch reset pattern
- Use cudaLaunchKernelEx with cooperative attribute for non-SM90 GPUs
- Guard TMA intrinsics with DISABLE_SM90_FEATURES to avoid ISA errors
- Add warp-copy fallback for TMA store paths on pre-SM90
- Handle CUDA 13+ cuda_fp8.h include compatibility in ep_configs.cuh

Signed-off-by: Manoj Gopalakrishnan <manoj.gopalakrishnan@intel.com>
@manojgop

manojgop commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@YangZhou1997 @MaoZiming - Could you please review this PR?. @bkpathak This PR needs to be rebased with main branch

Comment thread ep/include/ep_utils.cuh

// Acquire fence: ensure we see all writes from all blocks that arrived
// before us (transitively through the atomic total order).
__threadfence();

@MaoZiming MaoZiming Jun 2, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __threadfence(); here is not a acquire fence? Is it needed

@MaoZiming

Copy link
Copy Markdown
Member

@bkpathak Thank you for the contribution! would you also like to share some performance results on A100?

@bkpathak

bkpathak commented Jun 4, 2026

Copy link
Copy Markdown
Author

Sure, will add the test results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants