Welcome to the inference engine kernel with CUDA Core Compute Libraries where my mission is to make inference operator kernel delightful.
This repository contains implementations of deep learning algorithm operators based on the CUDA CCCL library, including RmsNorm, AddRmsNorm, and Moe, etc.
All repo code can be run on WSL Ubuntu24.04 with Windows11
cuInferenceEngine/
├── kernel/ # CUDA kernel headers
├── test/cpp_test/ # GTest source files
├── output/ # compiled test binaries (generated by make.sh)
└── make.sh # build script
sudo apt update
sudo apt install -y build-essential cmake ninja-build libgtest-dev python3 python3-pip
python3 -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu131PyTorch is only required for tests that include torch/torch.h (e.g. cuRmsNormCUDATest).
Use make.sh to compile any test under test/cpp_test/. The binary is written to output/<name>.
cd /mnt/c/Users/xincu/cuInferenceEngine
# no PyTorch required
./make.sh test/cpp_test/cuReorderTest.cpp
./make.sh test/cpp_test/cuDeepSeekUnPermutationCUDATest.cpp
# requires PyTorch (.venv is auto-activated when present)
source .venv/bin/activate
./make.sh test/cpp_test/cuRmsNormCUDATest.cppmake.sh will:
- auto-detect GPU architecture via
nvidia-smi(fallback:sm_89) - add
-Ikernelfor kernel headers - link GTest automatically
- detect PyTorch dependency from source and add Torch include/lib paths when needed
Optional environment variables:
# override GPU arch (example: compute capability 8.9 -> sm_89)
CUDA_ARCH=sm_89 ./make.sh test/cpp_test/cuReorderTest.cpp
# specify Python for PyTorch paths
PYTHON=python3 ./make.sh test/cpp_test/cuRmsNormCUDATest.cppQuery GPU compute capability manually:
nvidia-smi --query-gpu=compute_cap --format=csvPyTorch headers cannot be compiled by nvcc in a single translation unit. For
tests that include torch/torch.h, make.sh automatically uses split build:
nvcccompileskernel/<name>Kernel.cu(CUDA kernel only)g++compiles the test.cpp(PyTorch host code)g++links both objects
For cuRmsNormCUDATest, ensure kernel/cuRmsNormCUDAKernel.cu exists and run:
./make.sh test/cpp_test/cuRmsNormCUDATest.cppThis is usually not a hang. After printing arch : sm_89, nvcc is compiling
#include <torch/torch.h>, which pulls in a very large header tree. The first
build often takes 10–20 minutes with no further output.
Try:
# show nvcc progress
VERBOSE=1 ./make.sh test/cpp_test/cuRmsNormCUDATest.cpp
# put nvcc temp files on Linux native FS (recommended on WSL)
TMPDIR=/tmp/cuInferenceEngine-build ./make.sh test/cpp_test/cuRmsNormCUDATest.cppIf the repo lives under /mnt/c/ (Windows mount on WSL), compilation can be
much slower than on the Linux filesystem. For daily development, clone to
~/cuInferenceEngine instead.
While waiting, you can confirm nvcc is working in another terminal:
ps aux | grep nvcc
top -p $(pgrep -n nvcc)If CPU usage stays high, compilation is still in progress—please wait.
./output/cuReorderTest
# run a single test
./output/cuReorderTest --gtest_filter=ReorderCUDATest.uma_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.fold_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.last_token_extract
./output/cuReorderTest --gtest_filter=ReorderCUDATest.multi_segment_roundtrip
./output/cuReorderTest --gtest_filter=ReorderCUDATest.large_roundtrip./output/cuDeepSeekUnPermutationCUDATestexport LD_LIBRARY_PATH="$(python - <<'PY'
import os, torch
print(os.path.join(os.path.dirname(torch.__file__), "lib"))
PY
):$LD_LIBRARY_PATH"
# run sanity test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.sanity
# run regression test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.regression
# run perf test
./output/cuRmsNormCUDATest --gtest_filter=brRmsNormCUDAKernelTest.perf