Inference Engine Kernel with CUDA Core Compute Libraries (CCCL)

Welcome to the inference engine kernel with CUDA Core Compute Libraries where my mission is to make inference operator kernel delightful.

This repository contains implementations of deep learning algorithm operators based on the CUDA CCCL library, including RmsNorm, AddRmsNorm, and Moe, etc.

All repo code can be run on WSL Ubuntu24.04 with Windows11

Project layout

cuInferenceEngine/
├── kernel/                  # CUDA kernel headers
├── test/cpp_test/           # GTest source files
├── output/                  # compiled test binaries (generated by make.sh)
└── make.sh                  # build script

How to setup environment

sudo apt update
sudo apt install -y build-essential cmake ninja-build libgtest-dev python3 python3-pip

python3 -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu131

PyTorch is only required for tests that include torch/torch.h (e.g. cuRmsNormCUDATest).

How to build

Use make.sh to compile any test under test/cpp_test/. The binary is written to output/<name>.

cd /mnt/c/Users/xincu/cuInferenceEngine

# no PyTorch required
./make.sh test/cpp_test/cuReorderTest.cpp
./make.sh test/cpp_test/cuDeepSeekUnPermutationCUDATest.cpp

# requires PyTorch (.venv is auto-activated when present)
source .venv/bin/activate
./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

make.sh will:

auto-detect GPU architecture via nvidia-smi (fallback: sm_89)
add -Ikernel for kernel headers
link GTest automatically
detect PyTorch dependency from source and add Torch include/lib paths when needed

Optional environment variables:

# override GPU arch (example: compute capability 8.9 -> sm_89)
CUDA_ARCH=sm_89 ./make.sh test/cpp_test/cuReorderTest.cpp

# specify Python for PyTorch paths
PYTHON=python3 ./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

Query GPU compute capability manually:

nvidia-smi --query-gpu=compute_cap --format=csv

Build error: `List_inl.h: need 'typename' before 'decltype'`

PyTorch headers cannot be compiled by nvcc in a single translation unit. For tests that include torch/torch.h, make.sh automatically uses split build:

nvcc compiles kernel/<name>Kernel.cu (CUDA kernel only)
g++ compiles the test .cpp (PyTorch host code)
g++ links both objects

For cuRmsNormCUDATest, ensure kernel/cuRmsNormCUDAKernel.cu exists and run:

./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

Build seems stuck on cuRmsNormCUDATest?

This is usually not a hang. After printing arch : sm_89, nvcc is compiling #include <torch/torch.h>, which pulls in a very large header tree. The first build often takes 10–20 minutes with no further output.

Try:

# show nvcc progress
VERBOSE=1 ./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

# put nvcc temp files on Linux native FS (recommended on WSL)
TMPDIR=/tmp/cuInferenceEngine-build ./make.sh test/cpp_test/cuRmsNormCUDATest.cpp

If the repo lives under /mnt/c/ (Windows mount on WSL), compilation can be much slower than on the Linux filesystem. For daily development, clone to ~/cuInferenceEngine instead.

While waiting, you can confirm nvcc is working in another terminal:

ps aux | grep nvcc
top -p $(pgrep -n nvcc)

If CPU usage stays high, compilation is still in progress—please wait.

How to run

cuReorderTest

./output/cuReorderTest

# run a single test
./output/cuReorderTest --gtest_filter=ReorderCUDATest.uma_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.fold_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.last_token_extract
./output/cuReorderTest --gtest_filter=ReorderCUDATest.multi_segment_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.large_roundtrip

cuDeepSeekUnPermutationCUDATest

./output/cuDeepSeekUnPermutationCUDATest

cuRmsNormCUDATest

export LD_LIBRARY_PATH="$(python - <<'PY'
import os, torch
print(os.path.join(os.path.dirname(torch.__file__), "lib"))
PY
):$LD_LIBRARY_PATH"

# run sanity test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.sanity

# run regression test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.regression

# run perf test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.perf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Engine Kernel with CUDA Core Compute Libraries (CCCL)

Project layout

How to setup environment

How to build

Build error: `List_inl.h: need 'typename' before 'decltype'`

Build seems stuck on cuRmsNormCUDATest?

How to run

cuReorderTest

cuDeepSeekUnPermutationCUDATest

cuRmsNormCUDATest

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Inference Engine Kernel with CUDA Core Compute Libraries (CCCL)

Project layout

How to setup environment

How to build

Build error: List_inl.h: need 'typename' before 'decltype'

Build seems stuck on cuRmsNormCUDATest?

How to run

cuReorderTest

cuDeepSeekUnPermutationCUDATest

cuRmsNormCUDATest

Build error: `List_inl.h: need 'typename' before 'decltype'`