Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
A clear and concise description of what the bug is.
When compiling a simple CuTe kernel with GPUArch("sm_100f"), the process aborts with a segfault. The same code compiles and runs correctly when using GPUArch("sm_100a").
!!!!!!! Segfault encountered !!!!!!!
File "<unknown>", line 0, in TVMFFIFunctionCall
File "/usr/local/src/conda/python-3.13.13/Objects/descrobject.c", line 522, in wrapperdescr_raw_call
File "/usr/local/src/conda/python-3.13.13/Objects/descrobject.c", line 569, in wrapperdescr_call
File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 361, in _PyObject_Call
File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 373, in PyObject_Call
File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 381, in PyCFunction_Call
File "/usr/local/src/conda/python-3.13.13/Python/generated_cases.c.h", line 1362, in _PyEval_EvalFrameDefault
File "/usr/local/src/conda/python-3.13.13/Include/internal/pycore_ceval.h", line 120, in _PyEval_EvalFrame
File "/usr/local/src/conda/python-3.13.13/Python/ceval.c", line 1820, in _PyEval_Vector
File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 413, in _PyFunction_Vectorcall
File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 135, in _PyObject_VectorcallDictTstate
File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 504, in _PyObject_Call_Prepend
File "/usr/local/src/conda/python-3.13.13/Objects/typeobject.c", line 9581, in slot_tp_call
File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 242, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.13.13/Python/generated_cases.c.h", line 813, in _PyEval_EvalFrameDefault
File "/usr/local/src/conda/python-3.13.13/Python/ceval.c", line 604, in PyEval_EvalCode
File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1381, in run_eval_code_obj
File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1489, in run_mod
File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1295, in pyrun_file
File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 517, in _PyRun_SimpleFileObject
File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 77, in _PyRun_AnyFileObject
File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 410, in pymain_run_file_obj
File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 429, in pymain_run_file
File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 696, in pymain_run_python
File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 775, in Py_RunMain
File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 829, in Py_BytesMain
File "<unknown>", line 0, in _start
File "<unknown>", line 0, in 0xffffffffffffffff
[1] 3197558 segmentation fault (core dumped)
Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
import torch
import cutlass.cute as cute
from cutlass.cute import EnableTVMFFI, GPUArch
@cute.kernel
def print_tensor_kernel(a: cute.Tensor):
cute.printf("a: {}", a)
@cute.jit
def print_tensor(a: cute.Tensor):
print_tensor_kernel(a).launch(grid=(1, 1, 1), block=(1, 1, 1))
x = torch.randn(2, 3).cuda()
compiled_func = cute.compile[EnableTVMFFI(True), GPUArch("sm_100f")](print_tensor, cute.runtime.from_dlpack(x, enable_tvm_ffi=True))
print(compiled_func(x))
Expected behavior
A clear and concise description of what you expected to happen.
Compilation and execution should succeed for sm_100f (Blackwell sub-arch) just as it does for sm_100a.
Environment details (please complete the following information):
- Environment location: nebius cloud B200 node.
- CUDA Version: 12.9
- CuTe DSL Version: 4.5.2
- TVM FFI Version: 0.1.11
- Python Version: 3.13.13
- NVIDIA Driver Version: 570.211.01
Additional context
Add any other context about the problem here.
- With TVM-FFI False, it says
Traceback (most recent call last):
File "/home/taebum/test.py", line 15, in <module>
print(compiled_func(x))
~~~~~~~~~~~~~^^^
File "/home/taebum/program/miniconda3/envs/lib/python3.13/site-packages/nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/jit_executor.py", line 864, in _validate_engine
raise DSLRuntimeError(
...<2 lines>...
)
cutlass.base_dsl.common.DSLRuntimeError: DSLRuntimeError: The compiled function does not have a valid execution engine.
💡 Suggestions:
For cross-compilation, please use `JitCompiledFunction.export_to_c` to serialize the compiled function and load/execute it on target device.
Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
A clear and concise description of what the bug is.
When compiling a simple CuTe kernel with
GPUArch("sm_100f"), the process aborts with a segfault. The same code compiles and runs correctly when usingGPUArch("sm_100a").Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Expected behavior
A clear and concise description of what you expected to happen.
Compilation and execution should succeed for
sm_100f(Blackwell sub-arch) just as it does forsm_100a.Environment details (please complete the following information):
Additional context
Add any other context about the problem here.