build(cuda): register compute capability cfgs and optional legacy BF16/FP8 emulation by haricot · Pull Request #2704 · huggingface/candle

haricot · 2025-01-07T15:40:25Z

tested and works with:

nvidia-smi   --query-gpu="compute_cap"  --format=csv
compute_cap
6.1

cargo run -F "cuda,cudnn" --example llama  --release -- --model-id meta-llama/Llama-3.2-1B-Instruct --temperature 0.1 --which v32-1b-instruct --seed 42 --dtype bf16 --prompt

related EricLBuehler#57

LaurentMazare · 2025-01-07T15:44:43Z

Shouldn't the new code be in some else blocks for the #if __CUDA_ARCH__ >= 800 blocks that contain the proper bf16 implementation?

haricot · 2025-01-07T21:10:52Z

You are certainly right because I do not encounter the condition CUDA_ARCH >= 800 in this control flow in my case but this must possibly cause a function cuda error already present, I will review this.

LaurentMazare · 2025-01-15T21:04:20Z

I think the current code is likely to result in lots of compile failures with cuda compute cap >= 8.0.
Would it work to just replace the checks for compute cap 800 with 530? If you can make the changes on an old cuda toolkit, I can also test them on hardware that is past 800.

haricot · 2025-01-16T21:34:46Z

I hope that the latest additions will allow to work on __CUDA_ARCH__ >=800.
Then I think it logical to propose more tests by considering if possible a full support of the BF16 type for __CUDA_ARCH__ >=530 .

haricot · 2025-10-12T11:43:27Z

With this, we can notice for a similar token/s in f16 or bf16 (fot short sentence), the results bf16 are identical to an original model bf16 despite fallbacks, numerical fidelity preserved, model behavior unchanged.

Tests:
I used a macro_rules assert_tensor to be able to test the test functions on different types and verify that everything works. I thought it was not necessary to add it, for example it looked like this:

ex: assert_tensor! avg_pool2d (f32,bf16,f16)

    assert_tensor!(dev, (t:Tensor), (res):(Tensor)|{
            t.avg_pool2d(3)?.squeeze(0)?
        },[
        res/eq_round|(F32:to_vec3_round, each:4)(
            [[[0.085]], [[0.0078]]]
        ),
        res/eq_max|(F32-F16=>f32, cpu:4e-5, cuda:11e-5, metal:11e-5),
        res/eq_max|(F32-BF16=>f32, cpu:3e-5, cuda:5e-5, metal:5e-5),
    ]);

Issue:
Without this PR, it no longer compiles on my old laptop frozen in Cuda 12.9.

compile log

Caused by:
  process didn't exit successfully: `/tmp/candle/target/release/build/candle-kernels-ce3a0757978ddb55/build-script-build` (exit status:```
Caused by:
  process didn't exit successfully: `/tmp/candle/target/release/build/candle-kernels-ce3a0757978ddb55/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=src/compatibility.cuh
  cargo:rerun-if-changed=src/cuda_utils.cuh
  cargo:rerun-if-changed=src/binary_op_macros.cuh
  cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"]
  cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP
  cargo:rustc-env=CUDA_COMPUTE_CAP=61
  cargo:info=Builder { cuda_root: Some("/opt/cuda"), kernel_paths: ["src/affine.cu", "src/binary.cu", "src/cast.cu", "src/conv.cu", "src/fill.cu", "src/indexing.cu", "src/quantized.cu", "src/reduce.cu", "src/sort.cu", "src/ternary.cu", "src/unary.cu"], watch: [], include_paths: ["src/binary_op_macros.cuh", "src/compatibility.cuh", "src/cuda_utils.cuh"], compute_cap: Some(61), out_dir: "/tmp/candle/target/release/build/candle-kernels-4037943e9d714c54/out", extra_args: [] }
  cargo:rustc-env=CUDA_INCLUDE_DIR=/opt/cuda/include
  cargo:rerun-if-changed=src/binary_op_macros.cuh
  cargo:rerun-if-changed=src/compatibility.cuh
  cargo:rerun-if-changed=src/cuda_utils.cuh
  cargo:rerun-if-env-changed=NVCC_CCBIN
  cargo:rerun-if-changed=src/cast.cu
  cargo:rerun-if-changed=src/indexing.cu
  cargo:rerun-if-changed=src/affine.cu
  cargo:rerun-if-changed=src/conv.cu
  cargo:rerun-if-changed=src/binary.cu
  cargo:rerun-if-changed=src/sort.cu
  cargo:rerun-if-changed=src/quantized.cu
  cargo:rerun-if-changed=src/fill.cu
  cargo:rerun-if-changed=src/ternary.cu
  cargo:rerun-if-changed=src/reduce.cu
  cargo:rerun-if-changed=src/unary.cu

  --- stderr
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  src/reduce.cu(612): error: no instance of overloaded function "atomicAdd" matches the argument list
              argument types are: (__half *, const __half)
    extern "C" __attribute__((global)) void sum_f16( const size_t numel, const size_t num_dims, const size_t num_sum_dims, const size_t *info, const __half *inp, __half *out) { const size_t *dims = info; const size_t *strides = info + num_dims; const size_t *sum_dims_l = info + 2 * num_dims; const size_t *sum_dims_s = info + 2 * num_dims + num_sum_dims; if (is_contiguous(num_dims, dims, strides)) { for (unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; i < numel; i += blockDim.x * gridDim.x) { size_t dst_index = i; for (unsigned int nd = 0; nd < num_sum_dims; ++nd) { size_t stride = sum_dims_s[nd]; size_t pre = dst_index / stride; size_t post = dst_index % stride; dst_index = (pre / sum_dims_l[nd]) * stride + post; } atomicAdd(out + dst_index, inp[i]); } } else { for (unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; i < numel; i += blockDim.x * gridDim.x) { unsigned strided_i = get_strided_index(i, num_dims, dims, strides); size_t dst_index = i; for (unsigned int nd = 0; nd < num_sum_dims; ++nd) { size_t stride = sum_dims_s[nd]; size_t pre = dst_index / stride; size_t post = dst_index % stride; dst_index = (pre / sum_dims_l[nd]) * stride + post; } atomicAdd(out + dst_index, inp[strided_i]); } } }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             ^
  /opt/cuda/include/cuda_bf16.hpp(3802): note #3326-D: function "atomicAdd(__nv_bfloat162 *, __nv_bfloat162)" does not match because argument #1 does not match parameter
    static __attribute__((device)) __inline__ __nv_bfloat162 atomicAdd(__nv_bfloat162 *const address, const __nv_bfloat162 val)
                                                             ^
  /opt/cuda/include/cuda_fp16.hpp(3426): note #3326-D: function "atomicAdd(__half2 *, __half2)" does not match because argument #1 does not match parameter
    static __attribute__((device)) __inline__ __half2 atomicAdd(__half2 *const address, const __half2 val) {
                                                      ^
  /opt/cuda/include/sm_60_atomic_functions.hpp(292): note #3326-D: function "atomicAdd(double *, double)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) double atomicAdd(double *address, double val)
                                                     ^
  /opt/cuda/include/sm_20_atomic_functions.hpp(82): note #3326-D: function "atomicAdd(float *, float)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) float atomicAdd(float *address, float val)
                                                    ^
  /opt/cuda/include/device_atomic_functions.hpp(224): note #3326-D: function "atomicAdd(unsigned long long *, unsigned long long)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) unsigned long long int atomicAdd(unsigned long long int *address, unsigned long long int val)
                                                                     ^
  /opt/cuda/include/device_atomic_functions.hpp(110): note #3326-D: function "atomicAdd(unsigned int *, unsigned int)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) unsigned int atomicAdd(unsigned int *address, unsigned int val)
                                                           ^
  /opt/cuda/include/device_atomic_functions.hpp(105): note #3326-D: function "atomicAdd(int *, int)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) int atomicAdd(int *address, int val)
                                                  ^

  src/reduce.cu(612): error: no instance of overloaded function "atomicAdd" matches the argument list
              argument types are: (__half *, const __half)
    extern "C" __attribute__((global)) void sum_f16( const size_t numel, const size_t num_dims, const size_t num_sum_dims, const size_t *info, const __half *inp, __half *out) { const size_t *dims = info; const size_t *strides = info + num_dims; const size_t *sum_dims_l = info + 2 * num_dims; const size_t *sum_dims_s = info + 2 * num_dims + num_sum_dims; if (is_contiguous(num_dims, dims, strides)) { for (unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; i < numel; i += blockDim.x * gridDim.x) { size_t dst_index = i; for (unsigned int nd = 0; nd < num_sum_dims; ++nd) { size_t stride = sum_dims_s[nd]; size_t pre = dst_index / stride; size_t post = dst_index % stride; dst_index = (pre / sum_dims_l[nd]) * stride + post; } atomicAdd(out + dst_index, inp[i]); } } else { for (unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; i < numel; i += blockDim.x * gridDim.x) { unsigned strided_i = get_strided_index(i, num_dims, dims, strides); size_t dst_index = i; for (unsigned int nd = 0; nd < num_sum_dims; ++nd) { size_t stride = sum_dims_s[nd]; size_t pre = dst_index / stride; size_t post = dst_index % stride; dst_index = (pre / sum_dims_l[nd]) * stride + post; } atomicAdd(out + dst_index, inp[strided_i]); } } }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
  /opt/cuda/include/cuda_bf16.hpp(3802): note #3326-D: function "atomicAdd(__nv_bfloat162 *, __nv_bfloat162)" does not match because argument #1 does not match parameter
    static __attribute__((device)) __inline__ __nv_bfloat162 atomicAdd(__nv_bfloat162 *const address, const __nv_bfloat162 val)
                                                             ^
  /opt/cuda/include/cuda_fp16.hpp(3426): note #3326-D: function "atomicAdd(__half2 *, __half2)" does not match because argument #1 does not match parameter
    static __attribute__((device)) __inline__ __half2 atomicAdd(__half2 *const address, const __half2 val) {
                                                      ^
  /opt/cuda/include/sm_60_atomic_functions.hpp(292): note #3326-D: function "atomicAdd(double *, double)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) double atomicAdd(double *address, double val)
                                                     ^
  /opt/cuda/include/sm_20_atomic_functions.hpp(82): note #3326-D: function "atomicAdd(float *, float)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) float atomicAdd(float *address, float val)
                                                    ^
  /opt/cuda/include/device_atomic_functions.hpp(224): note #3326-D: function "atomicAdd(unsigned long long *, unsigned long long)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) unsigned long long int atomicAdd(unsigned long long int *address, unsigned long long int val)
                                                                     ^
  /opt/cuda/include/device_atomic_functions.hpp(110): note #3326-D: function "atomicAdd(unsigned int *, unsigned int)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) unsigned int atomicAdd(unsigned int *address, unsigned int val)
                                                           ^
  /opt/cuda/include/device_atomic_functions.hpp(105): note #3326-D: function "atomicAdd(int *, int)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) int atomicAdd(int *address, int val)
                                                  ^

  2 errors detected in the compilation of "src/reduce.cu".
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

  thread 'main' panicked at /home/np/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/bindgen_cuda-0.1.5/src/lib.rs:391:13:
  nvcc error while compiling "src/reduce.cu":

  # CLI "nvcc" "--gpu-architecture=sm_61" "--ptx" "--default-stream" "per-thread" "--output-directory" "/tmp/candle/target/release/build/candle-kernels-4037943e9d714c54/out" "-Isrc" "-I/opt/cuda/include" "-allow-unsupported-compiler" "-ccbin" "/usr/bin/g++-14" "src/reduce.cu" 

  # stdout


  # stderr

  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
``` 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=src/compatibility.cuh
  cargo:rerun-if-changed=src/cuda_utils.cuh
  cargo:rerun-if-changed=src/binary_op_macros.cuh
  cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"]
  cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP
  cargo:rustc-env=CUDA_COMPUTE_CAP=61
  cargo:info=Builder { cuda_root: Some("/opt/cuda"), kernel_paths: ["src/affine.cu", "src/binary.cu", "src/cast.cu", "src/conv.cu", "src/fill.cu", "src/indexing.cu", "src/quantized.cu", "src/reduce.cu", "src/sort.cu", "src/ternary.cu", "src/unary.cu"], watch: [], include_paths: ["src/binary_op_macros.cuh", "src/compatibility.cuh", "src/cuda_utils.cuh"], compute_cap: Some(61), out_dir: "/tmp/candle/target/release/build/candle-kernels-4037943e9d714c54/out", extra_args: [] }
  cargo:rustc-env=CUDA_INCLUDE_DIR=/opt/cuda/include
  cargo:rerun-if-changed=src/binary_op_macros.cuh
  cargo:rerun-if-changed=src/compatibility.cuh
  cargo:rerun-if-changed=src/cuda_utils.cuh
  cargo:rerun-if-env-changed=NVCC_CCBIN
  cargo:rerun-if-changed=src/cast.cu
  cargo:rerun-if-changed=src/indexing.cu
  cargo:rerun-if-changed=src/affine.cu
  cargo:rerun-if-changed=src/conv.cu
  cargo:rerun-if-changed=src/binary.cu
  cargo:rerun-if-changed=src/sort.cu
  cargo:rerun-if-changed=src/quantized.cu
  cargo:rerun-if-changed=src/fill.cu
  cargo:rerun-if-changed=src/ternary.cu
  cargo:rerun-if-changed=src/reduce.cu
  cargo:rerun-if-changed=src/unary.cu

  --- stderr
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
  src/reduce.cu(612): error: no instance of overloaded function "atomicAdd" matches the argument list
              argument types are: (__half *, const __half)
    extern "C" __attribute__((global)) void sum_f16( const size_t numel, const size_t num_dims, const size_t num_sum_dims, const size_t *info, const __half *inp, __half *out) { const size_t *dims = info; const size_t *strides = info + num_dims; const size_t *sum_dims_l = info + 2 * num_dims; const size_t *sum_dims_s = info + 2 * num_dims + num_sum_dims; if (is_contiguous(num_dims, dims, strides)) { for (unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; i < numel; i += blockDim.x * gridDim.x) { size_t dst_index = i; for (unsigned int nd = 0; nd < num_sum_dims; ++nd) { size_t stride = sum_dims_s[nd]; size_t pre = dst_index / stride; size_t post = dst_index % stride; dst_index = (pre / sum_dims_l[nd]) * stride + post; } atomicAdd(out + dst_index, inp[i]); } } else { for (unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; i < numel; i += blockDim.x * gridDim.x) { unsigned strided_i = get_strided_index(i, num_dims, dims, strides); size_t dst_index = i; for (unsigned int nd = 0; nd < num_sum_dims; ++nd) { size_t stride = sum_dims_s[nd]; size_t pre = dst_index / stride; size_t post = dst_index % stride; dst_index = (pre / sum_dims_l[nd]) * stride + post; } atomicAdd(out + dst_index, inp[strided_i]); } } }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             ^
  /opt/cuda/include/cuda_bf16.hpp(3802): note #3326-D: function "atomicAdd(__nv_bfloat162 *, __nv_bfloat162)" does not match because argument #1 does not match parameter
    static __attribute__((device)) __inline__ __nv_bfloat162 atomicAdd(__nv_bfloat162 *const address, const __nv_bfloat162 val)
                                                             ^
  /opt/cuda/include/cuda_fp16.hpp(3426): note #3326-D: function "atomicAdd(__half2 *, __half2)" does not match because argument #1 does not match parameter
    static __attribute__((device)) __inline__ __half2 atomicAdd(__half2 *const address, const __half2 val) {
                                                      ^
  /opt/cuda/include/sm_60_atomic_functions.hpp(292): note #3326-D: function "atomicAdd(double *, double)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) double atomicAdd(double *address, double val)
                                                     ^
  /opt/cuda/include/sm_20_atomic_functions.hpp(82): note #3326-D: function "atomicAdd(float *, float)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) float atomicAdd(float *address, float val)
                                                    ^
  /opt/cuda/include/device_atomic_functions.hpp(224): note #3326-D: function "atomicAdd(unsigned long long *, unsigned long long)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) unsigned long long int atomicAdd(unsigned long long int *address, unsigned long long int val)
                                                                     ^
  /opt/cuda/include/device_atomic_functions.hpp(110): note #3326-D: function "atomicAdd(unsigned int *, unsigned int)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) unsigned int atomicAdd(unsigned int *address, unsigned int val)
                                                           ^
  /opt/cuda/include/device_atomic_functions.hpp(105): note #3326-D: function "atomicAdd(int *, int)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) int atomicAdd(int *address, int val)
                                                  ^

  src/reduce.cu(612): error: no instance of overloaded function "atomicAdd" matches the argument list
              argument types are: (__half *, const __half)
    extern "C" __attribute__((global)) void sum_f16( const size_t numel, const size_t num_dims, const size_t num_sum_dims, const size_t *info, const __half *inp, __half *out) { const size_t *dims = info; const size_t *strides = info + num_dims; const size_t *sum_dims_l = info + 2 * num_dims; const size_t *sum_dims_s = info + 2 * num_dims + num_sum_dims; if (is_contiguous(num_dims, dims, strides)) { for (unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; i < numel; i += blockDim.x * gridDim.x) { size_t dst_index = i; for (unsigned int nd = 0; nd < num_sum_dims; ++nd) { size_t stride = sum_dims_s[nd]; size_t pre = dst_index / stride; size_t post = dst_index % stride; dst_index = (pre / sum_dims_l[nd]) * stride + post; } atomicAdd(out + dst_index, inp[i]); } } else { for (unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; i < numel; i += blockDim.x * gridDim.x) { unsigned strided_i = get_strided_index(i, num_dims, dims, strides); size_t dst_index = i; for (unsigned int nd = 0; nd < num_sum_dims; ++nd) { size_t stride = sum_dims_s[nd]; size_t pre = dst_index / stride; size_t post = dst_index % stride; dst_index = (pre / sum_dims_l[nd]) * stride + post; } atomicAdd(out + dst_index, inp[strided_i]); } } }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
  /opt/cuda/include/cuda_bf16.hpp(3802): note #3326-D: function "atomicAdd(__nv_bfloat162 *, __nv_bfloat162)" does not match because argument #1 does not match parameter
    static __attribute__((device)) __inline__ __nv_bfloat162 atomicAdd(__nv_bfloat162 *const address, const __nv_bfloat162 val)
                                                             ^
  /opt/cuda/include/cuda_fp16.hpp(3426): note #3326-D: function "atomicAdd(__half2 *, __half2)" does not match because argument #1 does not match parameter
    static __attribute__((device)) __inline__ __half2 atomicAdd(__half2 *const address, const __half2 val) {
                                                      ^
  /opt/cuda/include/sm_60_atomic_functions.hpp(292): note #3326-D: function "atomicAdd(double *, double)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) double atomicAdd(double *address, double val)
                                                     ^
  /opt/cuda/include/sm_20_atomic_functions.hpp(82): note #3326-D: function "atomicAdd(float *, float)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) float atomicAdd(float *address, float val)
                                                    ^
  /opt/cuda/include/device_atomic_functions.hpp(224): note #3326-D: function "atomicAdd(unsigned long long *, unsigned long long)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) unsigned long long int atomicAdd(unsigned long long int *address, unsigned long long int val)
                                                                     ^
  /opt/cuda/include/device_atomic_functions.hpp(110): note #3326-D: function "atomicAdd(unsigned int *, unsigned int)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) unsigned int atomicAdd(unsigned int *address, unsigned int val)
                                                           ^
  /opt/cuda/include/device_atomic_functions.hpp(105): note #3326-D: function "atomicAdd(int *, int)" does not match because argument #1 does not match parameter
    static __inline__ __attribute__((device)) int atomicAdd(int *address, int val)
                                                  ^

  2 errors detected in the compilation of "src/reduce.cu".
  nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

  thread 'main' panicked at /home/np/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/bindgen_cuda-0.1.5/src/lib.rs:391:13:
  nvcc error while compiling "src/reduce.cu":

  # CLI "nvcc" "--gpu-architecture=sm_61" "--ptx" "--default-stream" "per-thread" "--output-directory" "/tmp/candle/target/release/build/candle-kernels-4037943e9d714c54/out" "-Isrc" "-I/opt/cuda/include" "-allow-unsupported-compiler" "-ccbin" "/usr/bin/g++-14" "src/reduce.cu" 

  # stdout


  # stderr

  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

In resolving conflicts, i added fallback fp8 related #2989

Use warp-local lane indices when interpreting __activemask() during warp XOR reductions, and guard the second-stage shared-memory reduction for partial warp counts.

… tests

haricot · 2026-04-04T11:15:36Z

I force-pushed a cleaned up version of the branch and would appreciate another review pass.

The history is now split into focused commits for capability detection, legacy BF16/FP8 support, cuDNN fallbacks, MoE runtime changes, and regression coverage. I also removed the local candle-kernels MoE test harness and the redundant sigmoid_f16 / sigmoid_bf16 checks to keep the series easier to review.

The goal of this PR is to make Candle run more reliably on older NVIDIA GPUs by adding legacy BF16/FP8 fallback paths, improving capability detection, preserving numerical behavior where possible, and falling back from failing cuDNN convolution launches. This is mainly aimed at pre-Ampere cards.

Example invocation on older GPU using ALLOW_LEGACY all or bf16,fp8:
ALLOW_LEGACY="all" cargo run -F "cuda,cudnn" --example ...

sempervictus · 2026-05-11T21:07:57Z

If you're going all the way with this @guoqingbao now has an emulated FP4 impl as well which works great on SM70 anyway :-)

… builds, and add debug symbols configuration

…tion kernels

# Conflicts: # candle-kernels/build.rs # candle-kernels/src/compatibility.cuh

…v_bfloat16

…8 and BF16 stability checks

haricot force-pushed the bf16_candle branch from 108eca7 to 46042d4 Compare January 17, 2025 08:35

haricot marked this pull request as draft January 18, 2025 09:31

haricot force-pushed the bf16_candle branch 9 times, most recently from 33dcb9a to d5f31ea Compare January 23, 2025 16:10

haricot changed the title ~~add cuda fallback bf16 for compute_cap < 8.0~~ add cuda fallback bf16 for compute_cap >=530 <800 Jan 23, 2025

haricot force-pushed the bf16_candle branch from 0267ef1 to d5f31ea Compare October 11, 2025 23:17

haricot changed the title ~~add cuda fallback bf16 for compute_cap >=530 <800~~ CUDA: Add full BF16 and FP8 fallback support for compute >=530 Oct 12, 2025

haricot marked this pull request as ready for review October 12, 2025 11:48

haricot changed the title ~~CUDA: Add full BF16 and FP8 fallback support for compute >=530~~ CUDA: Add (full BF16) and FP8 fallback support for compute >=530 Oct 15, 2025

ivarflakstad reviewed Oct 24, 2025

View reviewed changes

Comment thread candle-kernels/src/affine.cu Outdated

haricot marked this pull request as draft October 30, 2025 12:27

haricot changed the title ~~CUDA: Add (full BF16) and FP8 fallback support for compute >=530~~ CUDA: Add (full BF16) and FP8 support for CC < 700 Jan 26, 2026

haricot force-pushed the bf16_candle branch 5 times, most recently from 78b6108 to d5f31ea Compare January 26, 2026 19:41

haricot changed the title ~~CUDA: Add (full BF16) and FP8 support for CC < 700~~ CUDA: Add (full BF16) and FP8 support for CC < 700 and 800 Feb 1, 2026

fix(cuda): correct warp reductions in rmsnorm/layernorm kernels

56819ae

Use warp-local lane indices when interpreting __activemask() during warp XOR reductions, and guard the second-stage shared-memory reduction for partial warp counts.

haricot force-pushed the bf16_candle branch from 506175a to de5de34 Compare April 2, 2026 20:14

haricot added 8 commits April 3, 2026 15:41

build(cuda): detect kernel capabilities and build legacy MoE kernels

bfce830

fix(cuda): gate legacy BF16/FP8 kernels and optimize f16 unary ops

039a46d

fix(cuda): fall back from cuDNN conv1d launch failures

e0675a4

feat(cuda): add MoE kernel dispatch and support contiguous offset views

9ca1522

test(cuda): add BF16, FP8, and MoE regression coverage

32dce2d

fix(cuda): fall back from cuDNN conv2d launch failures

d0f87b7

fix(cuda): zero-initialize MoE outputs and refine backend selection

e6f59e8

fix(cuda): tighten legacy atomicAdd fallbacks and BF16/FP8 regression…

633c9c6

… tests

haricot force-pushed the bf16_candle branch from de5de34 to 633c9c6 Compare April 3, 2026 18:26

haricot marked this pull request as ready for review April 3, 2026 18:42

haricot added 2 commits April 3, 2026 20:51

chore(cuda): drop candle-kernels cudarc test harness

fd54c03

test(nn): drop redundant f16 and bf16 sigmoid checks

e1f418a

This was referenced Apr 3, 2026

feat(cuda): add opt-in legacy BF16 and FP8 EricLBuehler/mistral.rs#1842

Open

Support for Qwen 3.5 Model Series (Dense and Quantized) #3396

Open

haricot closed this Apr 5, 2026

haricot reopened this Apr 5, 2026

haricot added 2 commits April 22, 2026 14:09

fix logic legacy bf16/fp8 macro checks in kernels

b269f85

use cudaforge 0.1.6 (patch.crates-io) for kernel build watch conf

8a2f70d

haricot added 5 commits May 19, 2026 20:02

fix: add fallback bf16 min/max for pre-Ampere GPUs, allow legacy bf16…

35de109

… builds, and add debug symbols configuration

build: update cudaforge dependency to specific git revision

ad332f0

feat: add candle-tiled-attn package with CUDA support for tiled atten…

ee2e6ac

…tion kernels

Merge remote-tracking branch 'origin/main' into bf16_candle

b80ed47

# Conflicts: # candle-kernels/build.rs # candle-kernels/src/compatibility.cuh

refactor: remove deprecated nan handling functions for __half and __n…

18e33a7

…v_bfloat16

haricot marked this pull request as draft June 11, 2026 07:10

haricot added 2 commits June 11, 2026 09:41

tiled_attn attn test

65b7c6c

test: add CUDA tests for MoE and tiled attention decoding, improve FP…

d746eac

…8 and BF16 stability checks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build(cuda): register compute capability cfgs and optional legacy BF16/FP8 emulation#2704

build(cuda): register compute capability cfgs and optional legacy BF16/FP8 emulation#2704
haricot wants to merge 22 commits into
huggingface:mainfrom
haricot:bf16_candle

haricot commented Jan 7, 2025

Uh oh!

LaurentMazare commented Jan 7, 2025

Uh oh!

haricot commented Jan 7, 2025

Uh oh!

LaurentMazare commented Jan 15, 2025

Uh oh!

haricot commented Jan 16, 2025

Uh oh!

haricot commented Oct 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

haricot commented Apr 4, 2026

Uh oh!

sempervictus commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

haricot commented Jan 7, 2025

Uh oh!

LaurentMazare commented Jan 7, 2025

Uh oh!

haricot commented Jan 7, 2025

Uh oh!

LaurentMazare commented Jan 15, 2025

Uh oh!

haricot commented Jan 16, 2025

Uh oh!

haricot commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

haricot commented Apr 4, 2026

Uh oh!

sempervictus commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haricot commented Oct 12, 2025 •

edited

Loading