Skip to content

[BUG][CuTeDSL] AOT compilation fails with GPUArch("sm_100f") + TVM-FFI on B200 node #3298

@ktaebum

Description

@ktaebum

Which component has the problem?

CuTe DSL

Bug Report

Describe the bug
A clear and concise description of what the bug is.

When compiling a simple CuTe kernel with GPUArch("sm_100f"), the process aborts with a segfault. The same code compiles and runs correctly when using GPUArch("sm_100a").

!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in TVMFFIFunctionCall
  File "/usr/local/src/conda/python-3.13.13/Objects/descrobject.c", line 522, in wrapperdescr_raw_call
  File "/usr/local/src/conda/python-3.13.13/Objects/descrobject.c", line 569, in wrapperdescr_call
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 361, in _PyObject_Call
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 373, in PyObject_Call
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 381, in PyCFunction_Call
  File "/usr/local/src/conda/python-3.13.13/Python/generated_cases.c.h", line 1362, in _PyEval_EvalFrameDefault
  File "/usr/local/src/conda/python-3.13.13/Include/internal/pycore_ceval.h", line 120, in _PyEval_EvalFrame
  File "/usr/local/src/conda/python-3.13.13/Python/ceval.c", line 1820, in _PyEval_Vector
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 413, in _PyFunction_Vectorcall
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 135, in _PyObject_VectorcallDictTstate
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 504, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.13.13/Objects/typeobject.c", line 9581, in slot_tp_call
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 242, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.13.13/Python/generated_cases.c.h", line 813, in _PyEval_EvalFrameDefault
  File "/usr/local/src/conda/python-3.13.13/Python/ceval.c", line 604, in PyEval_EvalCode
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1381, in run_eval_code_obj
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1489, in run_mod
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1295, in pyrun_file
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 517, in _PyRun_SimpleFileObject
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 77, in _PyRun_AnyFileObject
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 410, in pymain_run_file_obj
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 429, in pymain_run_file
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 696, in pymain_run_python
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 775, in Py_RunMain
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 829, in Py_BytesMain
  File "<unknown>", line 0, in _start
  File "<unknown>", line 0, in 0xffffffffffffffff

[1]    3197558 segmentation fault (core dumped)

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

     import torch
     import cutlass.cute as cute
     from cutlass.cute import EnableTVMFFI, GPUArch

     @cute.kernel
     def print_tensor_kernel(a: cute.Tensor):
         cute.printf("a: {}", a)

     @cute.jit
     def print_tensor(a: cute.Tensor):
         print_tensor_kernel(a).launch(grid=(1, 1, 1), block=(1, 1, 1))

     x = torch.randn(2, 3).cuda()
     compiled_func = cute.compile[EnableTVMFFI(True), GPUArch("sm_100f")](print_tensor, cute.runtime.from_dlpack(x, enable_tvm_ffi=True))
     print(compiled_func(x))

Expected behavior
A clear and concise description of what you expected to happen.

Compilation and execution should succeed for sm_100f (Blackwell sub-arch) just as it does for sm_100a.

Environment details (please complete the following information):

  • Environment location: nebius cloud B200 node.
  • CUDA Version: 12.9
  • CuTe DSL Version: 4.5.2
  • TVM FFI Version: 0.1.11
  • Python Version: 3.13.13
  • NVIDIA Driver Version: 570.211.01

Additional context
Add any other context about the problem here.

  • With TVM-FFI False, it says
Traceback (most recent call last):
  File "/home/taebum/test.py", line 15, in <module>
    print(compiled_func(x))
          ~~~~~~~~~~~~~^^^
  File "/home/taebum/program/miniconda3/envs/lib/python3.13/site-packages/nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/jit_executor.py", line 864, in _validate_engine
    raise DSLRuntimeError(
    ...<2 lines>...
    )
cutlass.base_dsl.common.DSLRuntimeError: DSLRuntimeError: The compiled function does not have a valid execution engine.
💡 Suggestions:
 For cross-compilation, please use `JitCompiledFunction.export_to_c` to serialize the compiled function and load/execute it on target device.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions