[BUG][CuTeDSL] AOT compilation fails with `GPUArch("sm_100f")` + TVM-FFI on B200 node

### Which component has the problem?

CuTe DSL

### Bug Report

**Describe the bug**
A clear and concise description of what the bug is.

When compiling a simple CuTe kernel with `GPUArch("sm_100f")`, the process aborts with a segfault. The same code compiles and runs correctly when using `GPUArch("sm_100a")`.

```
!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in TVMFFIFunctionCall
  File "/usr/local/src/conda/python-3.13.13/Objects/descrobject.c", line 522, in wrapperdescr_raw_call
  File "/usr/local/src/conda/python-3.13.13/Objects/descrobject.c", line 569, in wrapperdescr_call
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 361, in _PyObject_Call
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 373, in PyObject_Call
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 381, in PyCFunction_Call
  File "/usr/local/src/conda/python-3.13.13/Python/generated_cases.c.h", line 1362, in _PyEval_EvalFrameDefault
  File "/usr/local/src/conda/python-3.13.13/Include/internal/pycore_ceval.h", line 120, in _PyEval_EvalFrame
  File "/usr/local/src/conda/python-3.13.13/Python/ceval.c", line 1820, in _PyEval_Vector
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 413, in _PyFunction_Vectorcall
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 135, in _PyObject_VectorcallDictTstate
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 504, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.13.13/Objects/typeobject.c", line 9581, in slot_tp_call
  File "/usr/local/src/conda/python-3.13.13/Objects/call.c", line 242, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.13.13/Python/generated_cases.c.h", line 813, in _PyEval_EvalFrameDefault
  File "/usr/local/src/conda/python-3.13.13/Python/ceval.c", line 604, in PyEval_EvalCode
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1381, in run_eval_code_obj
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1489, in run_mod
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 1295, in pyrun_file
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 517, in _PyRun_SimpleFileObject
  File "/usr/local/src/conda/python-3.13.13/Python/pythonrun.c", line 77, in _PyRun_AnyFileObject
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 410, in pymain_run_file_obj
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 429, in pymain_run_file
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 696, in pymain_run_python
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 775, in Py_RunMain
  File "/usr/local/src/conda/python-3.13.13/Modules/main.c", line 829, in Py_BytesMain
  File "<unknown>", line 0, in _start
  File "<unknown>", line 0, in 0xffffffffffffffff

[1]    3197558 segmentation fault (core dumped)
```

**Steps/Code to reproduce bug**
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

```
     import torch
     import cutlass.cute as cute
     from cutlass.cute import EnableTVMFFI, GPUArch

     @cute.kernel
     def print_tensor_kernel(a: cute.Tensor):
         cute.printf("a: {}", a)

     @cute.jit
     def print_tensor(a: cute.Tensor):
         print_tensor_kernel(a).launch(grid=(1, 1, 1), block=(1, 1, 1))

     x = torch.randn(2, 3).cuda()
     compiled_func = cute.compile[EnableTVMFFI(True), GPUArch("sm_100f")](print_tensor, cute.runtime.from_dlpack(x, enable_tvm_ffi=True))
     print(compiled_func(x))
```

**Expected behavior**
A clear and concise description of what you expected to happen.

Compilation and execution should succeed for `sm_100f` (Blackwell sub-arch) just as it does for `sm_100a`.

**Environment details (please complete the following information):**
 - Environment location: nebius cloud B200 node.
 - CUDA Version: 12.9
 - CuTe DSL Version: 4.5.2
 - TVM FFI Version: 0.1.11
 - Python Version: 3.13.13
 - NVIDIA Driver Version: 570.211.01


**Additional context**
Add any other context about the problem here.

- With TVM-FFI False, it says
```
Traceback (most recent call last):
  File "/home/taebum/test.py", line 15, in <module>
    print(compiled_func(x))
          ~~~~~~~~~~~~~^^^
  File "/home/taebum/program/miniconda3/envs/lib/python3.13/site-packages/nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/jit_executor.py", line 864, in _validate_engine
    raise DSLRuntimeError(
    ...<2 lines>...
    )
cutlass.base_dsl.common.DSLRuntimeError: DSLRuntimeError: The compiled function does not have a valid execution engine.
💡 Suggestions:
 For cross-compilation, please use `JitCompiledFunction.export_to_c` to serialize the compiled function and load/execute it on target device.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][CuTeDSL] AOT compilation fails with `GPUArch("sm_100f")` + TVM-FFI on B200 node #3298

Which component has the problem?

Bug Report

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG][CuTeDSL] AOT compilation fails with GPUArch("sm_100f") + TVM-FFI on B200 node #3298

Description

Which component has the problem?

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG][CuTeDSL] AOT compilation fails with `GPUArch("sm_100f")` + TVM-FFI on B200 node #3298