build(cuda): register compute capability cfgs and optional legacy BF16/FP8 emulation#2704
build(cuda): register compute capability cfgs and optional legacy BF16/FP8 emulation#2704haricot wants to merge 22 commits into
Conversation
|
Shouldn't the new code be in some |
|
You are certainly right because I do not encounter the condition CUDA_ARCH >= 800 in this control flow in my case but this must possibly cause a function cuda error already present, I will review this. |
|
I think the current code is likely to result in lots of compile failures with cuda compute cap >= 8.0. |
|
I hope that the latest additions will allow to work on |
33dcb9a to
d5f31ea
Compare
|
With this, we can notice for a similar token/s in f16 or bf16 (fot short sentence), the results bf16 are identical to an original model bf16 despite fallbacks, numerical fidelity preserved, model behavior unchanged. Tests:
Issue:
In resolving conflicts, i added fallback fp8 related #2989 |
78b6108 to
d5f31ea
Compare
Use warp-local lane indices when interpreting __activemask() during
warp XOR reductions, and guard the second-stage shared-memory reduction
for partial warp counts.
|
I force-pushed a cleaned up version of the branch and would appreciate another review pass. The history is now split into focused commits for capability detection, legacy BF16/FP8 support, cuDNN fallbacks, MoE runtime changes, and regression coverage. I also removed the local The goal of this PR is to make Candle run more reliably on older NVIDIA GPUs by adding legacy BF16/FP8 fallback paths, improving capability detection, preserving numerical behavior where possible, and falling back from failing cuDNN convolution launches. This is mainly aimed at pre-Ampere cards. Example invocation on older GPU using ALLOW_LEGACY all or bf16,fp8: |
|
If you're going all the way with this @guoqingbao now has an emulated FP4 impl as well which works great on SM70 anyway :-) |
… builds, and add debug symbols configuration
# Conflicts: # candle-kernels/build.rs # candle-kernels/src/compatibility.cuh
…8 and BF16 stability checks
tested and works with:
related EricLBuehler#57