[BUG] NVRTC build fails for `sm90_tf32_hc_prenorm_gemm` — `cudaGridDependencySynchronize` undefined (PDL intrinsic not available under NVRTC)

### Describe the bug

When the NVRTC JIT backend is enabled (`DG_JIT_USE_NVRTC=1`), the HyperConnection prenorm GEMM kernel fails to compile because it calls the PDL (Programmatic Dependent Launch) device intrinsic `cudaGridDependencySynchronize()`, which NVRTC does not expose. The kernel uses it unconditionally and does not include the device-runtime header, so it only compiles under NVCC (which implicitly injects the CUDA device runtime).

NVRTC compilation aborts with:

```
deep_gemm/include/deep_gemm/impls/sm90_tf32_hc_prenorm_gemm.cuh(120): error:
  identifier "cudaGridDependencySynchronize" is undefined
1 error detected in the compilation of "kernel.cu".
```

which then surfaces to the caller as an invalid cubin:

```
CUDA driver error (csrc/.../jit/handle.hpp:141): 200 (CUDA_ERROR_INVALID_IMAGE,
  device kernel image is invalid)
```

### Offending code

`deep_gemm/include/deep_gemm/impls/sm90_tf32_hc_prenorm_gemm.cuh` (~line 120):

```cpp
    // Wait for primary kernel completion
    cudaGridDependencySynchronize();
```

There is no `#if !defined(__CUDACC_RTC__)` guard, and the kernel does not `#include <cuda_device_runtime_api.h>`. PDL device intrinsics (`cudaGridDependencySynchronize`, `cudaTriggerProgrammaticLaunchCompletion`) are provided by NVCC's implicit device-runtime injection, which NVRTC does not perform. (The same pattern likely affects `sm100_tf32_hc_prenorm_gemm.cuh`.)

### Environment

- DeepGEMM: 2.5.0 (`891d57b`)
- Compiler path: NVRTC (`DG_JIT_USE_NVRTC=1`)
- CUDA Toolkit: 12.9.1
- GPU: NVIDIA H100 (SM90)
- PyTorch: 2.11.0 (cu129), Python 3.12
- Invoked via vLLM 0.22.1 (DeepSeek-V4 FP8)

### Stacktrace
```
(APIServer pid=45) INFO 06-12 19:40:02 [config.py:803] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM.
(APIServer pid=45) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(APIServer pid=45) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(APIServer pid=45) INFO 06-12 19:40:20 [model.py:617] Resolved architecture: DeepseekV4ForCausalLM
(APIServer pid=45) INFO 06-12 19:40:20 [model.py:1752] Using max model len 523264
(APIServer pid=45) INFO 06-12 19:40:21 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=45) INFO 06-12 19:40:21 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=45) INFO 06-12 19:40:21 [vllm.py:977] Asynchronous scheduling is enabled.
(APIServer pid=45) INFO 06-12 19:40:21 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=45) WARNING 06-12 19:40:27 [vllm.py:1396] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implemented the `reasoning_start_str` and `reasoning_end_str`.
(APIServer pid=45) WARNING 06-12 19:40:27 [vllm.py:1471] Turning off hybrid kv cache manager because connector LMCacheConnectorV1 does not subclass `SupportsHMA`. This will reduce performance on models with sliding window or Mamba attention. See kv_connector/v1/base.py for details.
(APIServer pid=45) INFO 06-12 19:40:27 [compilation.py:312] Enabled custom fusions: norm_quant, act_quant, allreduce_rms
(APIServer pid=45) WARNING 06-12 19:40:27 [system_utils.py:157] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore pid=515) INFO 06-12 19:40:36 [core.py:112] Initializing a V1 LLM engine (v0.22.1) with config: model='deepseek-ai/deepseek-v4-flash', speculative_config=None, tokenizer='deepseek-ai/deepseek-v4-flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=523264, download_dir='/workspace/weights', load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-v4-flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::qwen_gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=515) WARNING 06-12 19:40:36 [multiproc_executor.py:1029] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=515) INFO 06-12 19:40:36 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.3.29 (local), world_size=2, local_world_size=2
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=677) INFO 06-12 19:40:45 [parallel_state.py:1422] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:48603 backend=nccl
(Worker pid=678) INFO 06-12 19:40:45 [parallel_state.py:1422] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:48603 backend=nccl
(Worker pid=677) INFO 06-12 19:40:45 [pynccl.py:113] vLLM is using nccl==2.28.9
(Worker pid=678) WARNING 06-12 19:40:45 [symm_mem.py:106] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(Worker pid=677) WARNING 06-12 19:40:45 [symm_mem.py:106] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(Worker pid=677) INFO 06-12 19:40:45 [cuda_communicator.py:232] Using ['CUSTOM', 'PYNCCL'] all-reduce backends (in dispatch order) for group 'tp:0' out of potential backends: ['NCCL_SYMM_MEM', 'QUICK_REDUCE', 'FLASHINFER', 'CUSTOM', 'SYMM_MEM', 'PYNCCL'].
(Worker pid=677) INFO 06-12 19:40:46 [cuda_communicator.py:232] Using ['PYNCCL'] all-reduce backends (in dispatch order) for group 'ep:0' out of potential backends: ['NCCL_SYMM_MEM', 'QUICK_REDUCE', 'FLASHINFER', 'CUSTOM', 'SYMM_MEM', 'PYNCCL'].
(Worker pid=677) INFO 06-12 19:40:46 [parallel_state.py:1735] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=677) INFO 06-12 19:40:46 [topk_topp_sampler.py:70] FlashInfer top-p/top-k sampling disabled via VLLM_USE_FLASHINFER_SAMPLER=0; using PyTorch-native sampler.
(Worker_TP0 pid=677) INFO 06-12 19:40:46 [gpu_model_runner.py:5037] Starting to load model deepseek-ai/deepseek-v4-flash...
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [quant_config.py:73] DeepSeek V4 expert_dtype resolved to 'fp4'
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [__init__.py:533] Selected FlashInferFp8DeepGEMMDynamicBlockScaledKernel for Fp8LinearMethod
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [deep_gemm.py:113] DeepGEMM E8M0 enabled on current platform.
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [attention.py:688] Using DeepSeek's fp8_ds_mla KV cache format.
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [mxfp4.py:647] Using 'MARLIN' Mxfp4 MoE backend.
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [attention.py:797] Using FP8 indexer cache for Lightning Indexer.
(APIServer pid=45) INFO 06-12 19:40:49 inference_api.py:328] Model deepseek-ai/deepseek-v4-flash download complete: 0.01 GiB on disk in 48.2s
(Worker_TP0 pid=677) INFO 06-12 19:41:31 [weight_utils.py:603] Time spent downloading weights for deepseek-ai/deepseek-v4-flash: 41.932006 seconds
(Worker_TP0 pid=677) INFO 06-12 19:41:31 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 148.66 GiB. Available RAM: 609.93 GiB.
(Worker_TP0 pid=677) INFO 06-12 19:41:31 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:   0% Completed | 0/46 [00:00<?, ?it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:   4% Completed | 2/46 [00:00<00:12,  3.52it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:   7% Completed | 3/46 [00:01<00:17,  2.49it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:   9% Completed | 4/46 [00:01<00:19,  2.16it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  11% Completed | 5/46 [00:02<00:20,  2.00it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  13% Completed | 6/46 [00:02<00:20,  1.92it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  15% Completed | 7/46 [00:03<00:20,  1.86it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  17% Completed | 8/46 [00:03<00:20,  1.83it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  20% Completed | 9/46 [00:04<00:20,  1.82it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  22% Completed | 10/46 [00:05<00:19,  1.81it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  24% Completed | 11/46 [00:05<00:19,  1.80it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  26% Completed | 12/46 [00:06<00:19,  1.79it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  28% Completed | 13/46 [00:06<00:18,  1.79it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  30% Completed | 14/46 [00:07<00:17,  1.79it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  33% Completed | 15/46 [00:07<00:17,  1.79it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  35% Completed | 16/46 [00:08<00:16,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  37% Completed | 17/46 [00:09<00:16,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  39% Completed | 18/46 [00:09<00:15,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  41% Completed | 19/46 [00:10<00:15,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  43% Completed | 20/46 [00:10<00:14,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  46% Completed | 21/46 [00:11<00:14,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  48% Completed | 22/46 [00:11<00:13,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  50% Completed | 23/46 [00:12<00:12,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  52% Completed | 24/46 [00:12<00:12,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  54% Completed | 25/46 [00:13<00:11,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  57% Completed | 26/46 [00:14<00:11,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  59% Completed | 27/46 [00:14<00:10,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  61% Completed | 28/46 [00:15<00:10,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  63% Completed | 29/46 [00:15<00:09,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  65% Completed | 30/46 [00:16<00:09,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  67% Completed | 31/46 [00:16<00:08,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  70% Completed | 32/46 [00:17<00:07,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  72% Completed | 33/46 [00:18<00:07,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  74% Completed | 34/46 [00:18<00:06,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  76% Completed | 35/46 [00:19<00:06,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  78% Completed | 36/46 [00:19<00:05,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  80% Completed | 37/46 [00:20<00:05,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  83% Completed | 38/46 [00:20<00:04,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  85% Completed | 39/46 [00:21<00:03,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  87% Completed | 40/46 [00:22<00:03,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  89% Completed | 41/46 [00:22<00:02,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  91% Completed | 42/46 [00:23<00:02,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  93% Completed | 43/46 [00:23<00:01,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  96% Completed | 44/46 [00:24<00:01,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  98% Completed | 45/46 [00:24<00:00,  2.29it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards: 100% Completed | 46/46 [00:24<00:00,  1.88it/s]
(Worker_TP0 pid=677) 
(Worker_TP0 pid=677) INFO 06-12 19:41:56 [default_loader.py:397] Loading weights took 24.45 seconds
(Worker_TP0 pid=677) INFO 06-12 19:41:56 [mxfp4.py:1695] Using MoEPrepareAndFinalizeNoDPEPModular
(Worker_TP0 pid=677) INFO 06-12 19:42:01 [gpu_model_runner.py:5132] Model loading took 74.08 GiB memory and 73.197066 seconds
(Worker_TP0 pid=677) INFO 06-12 19:42:01 [interface.py:496] Setting kv cache block size to 256 for DEEPSEEK_SPARSE_SWA backend.
(Worker_TP1 pid=678) INFO 06-12 19:42:03 [interface.py:496] Setting kv cache block size to 256 for DEEPSEEK_SPARSE_SWA backend.
(Worker_TP0 pid=677) INFO 06-12 19:42:09 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/cc2572f778/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=677) INFO 06-12 19:42:09 [backends.py:1148] Dynamo bytecode transform time: 5.21 s
(Worker_TP0 pid=677) INFO 06-12 19:42:09 [flashinfer_all_reduce.py:111] Auto-selected flashinfer allreduce backend: trtllm
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] Failed to initialize FlashInfer All Reduce workspace: Ninja build failed. Ninja output:
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] ninja: Entering directory `/root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm'
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] [1/4]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_allreduce.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] FAILED: [code=127] /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_allreduce.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] [2/4]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] FAILED: [code=127] /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] [3/4]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] FAILED: [code=127] /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] ninja: build stopped: subcommand failed.
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] .
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:156] Failed to initialize FlashInfer Allreduce norm fusion workspace with backend=trtllm
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [allreduce_rms_fusion.py:820] Failed to initialize Flashinfer allreduce workspace. Flashinfer allreduce-norm fusion will be disabled.
(Worker_TP0 pid=677) WARNING 06-12 19:42:11 [allreduce_rms_fusion.py:893] AllReduce fusion pass is disabled.
(Worker_TP0 pid=677) INFO 06-12 19:42:11 [backends.py:378] Cache the graph of compile range (1, 8192) for later use
(Worker_TP0 pid=677) INFO 06-12 19:42:14 [backends.py:393] Compiling a graph for compile range (1, 8192) takes 4.09 s
(Worker_TP0 pid=677) INFO 06-12 19:42:23 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/913996eeb271f2db39df794c131abfe576b317e1205d345660c0e962ea19dd07/rank_0_0/model
(Worker_TP0 pid=677) INFO 06-12 19:42:23 [monitor.py:53] torch.compile took 19.04 s in total
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] Traceback (most recent call last):
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     output = func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 396, in determine_available_memory
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     self.model_runner.profile_run()
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 6164, in profile_run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                                         ^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5824, in _dummy_run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     outputs = self.model(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]               ^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/models/deepseek_v4/nvidia/model.py", line 1458, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     hidden_states = self.model(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                     ^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 670, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     output = self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/models/deepseek_v4/nvidia/model.py", line 1202, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     def forward(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/caching.py", line 217, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "<string>", line 177, in execution_fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return range_entry.runnable(*args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._compiled_fn(*args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return compiled_fn(full_args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     all_outs = call_func_at_runtime_with_args(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     out = normalize_as_list(f(args))
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                             ^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2298, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.compiled_fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return compiled_fn(runtime_args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     outs = compiled_fn(args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.current_callable(inputs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     out = model(new_inputs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]           ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/913996eeb271f2db39df794c131abfe576b317e1205d345660c0e962ea19dd07/inductor_cache/qc/cqctjpbr4d32ob7d4v44ybigloj3mholu57h4tl6eymi7jus3ntu.py", line 379, in call
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     buf4 = torch.ops.vllm.mhc_pre_tilelang.default(buf3, arg4_1, arg5_1, arg6_1, 1e-06, 1e-06, 1e-06, 2.0, 20, 1, arg3_1, norm_eps=1e-06)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return disable_fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     res = func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]           ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/kernels/mhc/tilelang.py", line 195, in mhc_pre_tilelang
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     tf32_hc_prenorm_gemm(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return _tf32_hc_prenorm_gemm_impl(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] RuntimeError: CUDA driver error (csrc/apis/../jit_kernels/impls/../../jit/handle.hpp:141): 200 (CUDA_ERROR_INVALID_IMAGE, device kernel image is invalid)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] Traceback (most recent call last):
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     output = func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 396, in determine_available_memory
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     self.model_runner.profile_run()
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 6164, in profile_run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                                         ^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5824, in _dummy_run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     outputs = self.model(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]               ^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/models/deepseek_v4/nvidia/model.py", line 1458, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     hidden_states = self.model(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                     ^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 670, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     output = self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/models/deepseek_v4/nvidia/model.py", line 1202, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     def forward(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/caching.py", line 217, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "<string>", line 177, in execution_fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return range_entry.runnable(*args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._compiled_fn(*args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return compiled_fn(full_args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     all_outs = call_func_at_runtime_with_args(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     out = normalize_as_list(f(args))
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                             ^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2298, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.compiled_fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return compiled_fn(runtime_args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     outs = compiled_fn(args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.current_callable(inputs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     out = model(new_inputs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]           ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/913996eeb271f2db39df794c131abfe576b317e1205d345660c0e962ea19dd07/inductor_cache/qc/cqctjpbr4d32ob7d4v44ybigloj3mholu57h4tl6eymi7jus3ntu.py", line 379, in call
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     buf4 = torch.ops.vllm.mhc_pre_tilelang.default(buf3, arg4_1, arg5_1, arg6_1, 1e-06, 1e-06, 1e-06, 2.0, 20, 1, arg3_1, norm_eps=1e-06)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return disable_fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     res = func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]           ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/kernels/mhc/tilelang.py", line 195, in mhc_pre_tilelang
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     tf32_hc_prenorm_gemm(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return _tf32_hc_prenorm_gemm_impl(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] RuntimeError: CUDA driver error (csrc/apis/../jit_kernels/impls/../../jit/handle.hpp:141): 200 (CUDA_ERROR_INVALID_IMAGE, device kernel image is invalid)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] 
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165] EngineCore failed to start.
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165] Traceback (most recent call last):
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1139, in run_engine_core
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return func(*args, **kwargs)
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 905, in __init__
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     super().__init__(
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 131, in __init__
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return func(*args, **kwargs)
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 253, in _initialize_kv_caches
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return future if non_block else future.result()
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                                     ^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return super().result()
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return self.__get_result()
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     raise self._exception
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     response = self.aggregate(self.get_response())
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 390, in get_response
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     raise RuntimeError(
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165] RuntimeError: Worker failed with error 'CUDA driver error (csrc/apis/../jit_kernels/impls/../../jit/handle.hpp:141): 200 (CUDA_ERROR_INVALID_IMAGE, device kernel image is invalid)', please check the stack trace above for the root cause
(Worker_TP0 pid=677) WARNING 06-12 19:42:25 [multiproc_executor.py:884] WorkerProc was terminated
(Worker_TP1 pid=678) WARNING 06-12 19:42:26 [multiproc_executor.py:884] WorkerProc was terminated
NVRTC log: "kernel.cu": creating precompiled header file "kernel.pch"
/usr/local/lib/python3.12/site-packages/deep_gemm/include/deep_gemm/impls/sm90_tf32_hc_prenorm_gemm.cuh(120): error: identifier "cudaGridDependencySynchronize" is undefined
      cudaGridDependencySynchronize();
      ^

1 error detected in the compilation of "kernel.cu".

NVRTC log: "kernel.cu": creating precompiled header file "kernel.pch"
/usr/local/lib/python3.12/site-packages/deep_gemm/include/deep_gemm/impls/sm90_tf32_hc_prenorm_gemm.cuh(120): error: identifier "cudaGridDependencySynchronize" is undefined
      cudaGridDependencySynchronize();
      ^

1 error detected in the compilation of "kernel.cu".
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NVRTC build fails for `sm90_tf32_hc_prenorm_gemm` — `cudaGridDependencySynchronize` undefined (PDL intrinsic not available under NVRTC) #359

Describe the bug

Offending code

Environment

Stacktrace

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] NVRTC build fails for sm90_tf32_hc_prenorm_gemm — cudaGridDependencySynchronize undefined (PDL intrinsic not available under NVRTC) #359

Description

Describe the bug

Offending code

Environment

Stacktrace

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG] NVRTC build fails for `sm90_tf32_hc_prenorm_gemm` — `cudaGridDependencySynchronize` undefined (PDL intrinsic not available under NVRTC) #359