Skip to content

cuiyixin555/cuInferenceEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inference Engine Kernel with CUDA Core Compute Libraries (CCCL)

Welcome to the inference engine kernel with CUDA Core Compute Libraries where my mission is to make inference operator kernel delightful.

This repository contains implementations of deep learning algorithm operators based on the CUDA CCCL library, including RmsNorm, AddRmsNorm, and Moe, etc.

All repo code can be run on WSL Ubuntu24.04 with Windows11

Project layout

cuInferenceEngine/
├── kernel/                  # CUDA kernel headers
├── test/cpp_test/           # GTest source files
├── output/                  # compiled test binaries (generated by make.sh)
└── make.sh                  # build script

How to setup environment

sudo apt update
sudo apt install -y build-essential cmake ninja-build libgtest-dev python3 python3-pip

python3 -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu131

PyTorch is only required for tests that include torch/torch.h (e.g. cuRmsNormCUDATest).

How to build

Use make.sh to compile any test under test/cpp_test/. The binary is written to output/<name>.

cd /mnt/c/Users/xincu/cuInferenceEngine

# no PyTorch required
./make.sh test/cpp_test/cuReorderTest.cpp
./make.sh test/cpp_test/cuDeepSeekUnPermutationCUDATest.cpp

# requires PyTorch (.venv is auto-activated when present)
source .venv/bin/activate
./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

make.sh will:

  • auto-detect GPU architecture via nvidia-smi (fallback: sm_89)
  • add -Ikernel for kernel headers
  • link GTest automatically
  • detect PyTorch dependency from source and add Torch include/lib paths when needed

Optional environment variables:

# override GPU arch (example: compute capability 8.9 -> sm_89)
CUDA_ARCH=sm_89 ./make.sh test/cpp_test/cuReorderTest.cpp

# specify Python for PyTorch paths
PYTHON=python3 ./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

Query GPU compute capability manually:

nvidia-smi --query-gpu=compute_cap --format=csv

Build error: List_inl.h: need 'typename' before 'decltype'

PyTorch headers cannot be compiled by nvcc in a single translation unit. For tests that include torch/torch.h, make.sh automatically uses split build:

  1. nvcc compiles kernel/<name>Kernel.cu (CUDA kernel only)
  2. g++ compiles the test .cpp (PyTorch host code)
  3. g++ links both objects

For cuRmsNormCUDATest, ensure kernel/cuRmsNormCUDAKernel.cu exists and run:

./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

Build seems stuck on cuRmsNormCUDATest?

This is usually not a hang. After printing arch : sm_89, nvcc is compiling #include <torch/torch.h>, which pulls in a very large header tree. The first build often takes 10–20 minutes with no further output.

Try:

# show nvcc progress
VERBOSE=1 ./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

# put nvcc temp files on Linux native FS (recommended on WSL)
TMPDIR=/tmp/cuInferenceEngine-build ./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

If the repo lives under /mnt/c/ (Windows mount on WSL), compilation can be much slower than on the Linux filesystem. For daily development, clone to ~/cuInferenceEngine instead.

While waiting, you can confirm nvcc is working in another terminal:

ps aux | grep nvcc
top -p $(pgrep -n nvcc)

If CPU usage stays high, compilation is still in progress—please wait.

How to run

cuReorderTest

./output/cuReorderTest

# run a single test
./output/cuReorderTest --gtest_filter=ReorderCUDATest.uma_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.fold_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.last_token_extract
./output/cuReorderTest --gtest_filter=ReorderCUDATest.multi_segment_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.large_roundtrip

cuDeepSeekUnPermutationCUDATest

./output/cuDeepSeekUnPermutationCUDATest

cuRmsNormCUDATest

export LD_LIBRARY_PATH="$(python - <<'PY'
import os, torch
print(os.path.join(os.path.dirname(torch.__file__), "lib"))
PY
):$LD_LIBRARY_PATH"

# run sanity test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.sanity

# run regression test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.regression

# run perf test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.perf

About

This is repo for LLM reasoning with CCCL with CUDA13.0

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors