Skip to content

[BUG] NVRTC build fails for sm90_tf32_hc_prenorm_gemmcudaGridDependencySynchronize undefined (PDL intrinsic not available under NVRTC) #359

@zhehli688

Description

@zhehli688

Describe the bug

When the NVRTC JIT backend is enabled (DG_JIT_USE_NVRTC=1), the HyperConnection prenorm GEMM kernel fails to compile because it calls the PDL (Programmatic Dependent Launch) device intrinsic cudaGridDependencySynchronize(), which NVRTC does not expose. The kernel uses it unconditionally and does not include the device-runtime header, so it only compiles under NVCC (which implicitly injects the CUDA device runtime).

NVRTC compilation aborts with:

deep_gemm/include/deep_gemm/impls/sm90_tf32_hc_prenorm_gemm.cuh(120): error:
  identifier "cudaGridDependencySynchronize" is undefined
1 error detected in the compilation of "kernel.cu".

which then surfaces to the caller as an invalid cubin:

CUDA driver error (csrc/.../jit/handle.hpp:141): 200 (CUDA_ERROR_INVALID_IMAGE,
  device kernel image is invalid)

Offending code

deep_gemm/include/deep_gemm/impls/sm90_tf32_hc_prenorm_gemm.cuh (~line 120):

    // Wait for primary kernel completion
    cudaGridDependencySynchronize();

There is no #if !defined(__CUDACC_RTC__) guard, and the kernel does not #include <cuda_device_runtime_api.h>. PDL device intrinsics (cudaGridDependencySynchronize, cudaTriggerProgrammaticLaunchCompletion) are provided by NVCC's implicit device-runtime injection, which NVRTC does not perform. (The same pattern likely affects sm100_tf32_hc_prenorm_gemm.cuh.)

Environment

  • DeepGEMM: 2.5.0 (891d57b)
  • Compiler path: NVRTC (DG_JIT_USE_NVRTC=1)
  • CUDA Toolkit: 12.9.1
  • GPU: NVIDIA H100 (SM90)
  • PyTorch: 2.11.0 (cu129), Python 3.12
  • Invoked via vLLM 0.22.1 (DeepSeek-V4 FP8)

Stacktrace

(APIServer pid=45) INFO 06-12 19:40:02 [config.py:803] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM.
(APIServer pid=45) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(APIServer pid=45) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(APIServer pid=45) INFO 06-12 19:40:20 [model.py:617] Resolved architecture: DeepseekV4ForCausalLM
(APIServer pid=45) INFO 06-12 19:40:20 [model.py:1752] Using max model len 523264
(APIServer pid=45) INFO 06-12 19:40:21 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=45) INFO 06-12 19:40:21 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=45) INFO 06-12 19:40:21 [vllm.py:977] Asynchronous scheduling is enabled.
(APIServer pid=45) INFO 06-12 19:40:21 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=45) WARNING 06-12 19:40:27 [vllm.py:1396] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implemented the `reasoning_start_str` and `reasoning_end_str`.
(APIServer pid=45) WARNING 06-12 19:40:27 [vllm.py:1471] Turning off hybrid kv cache manager because connector LMCacheConnectorV1 does not subclass `SupportsHMA`. This will reduce performance on models with sliding window or Mamba attention. See kv_connector/v1/base.py for details.
(APIServer pid=45) INFO 06-12 19:40:27 [compilation.py:312] Enabled custom fusions: norm_quant, act_quant, allreduce_rms
(APIServer pid=45) WARNING 06-12 19:40:27 [system_utils.py:157] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore pid=515) INFO 06-12 19:40:36 [core.py:112] Initializing a V1 LLM engine (v0.22.1) with config: model='deepseek-ai/deepseek-v4-flash', speculative_config=None, tokenizer='deepseek-ai/deepseek-v4-flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=523264, download_dir='/workspace/weights', load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-v4-flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::qwen_gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=515) WARNING 06-12 19:40:36 [multiproc_executor.py:1029] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=515) INFO 06-12 19:40:36 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.3.29 (local), world_size=2, local_world_size=2
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=677) INFO 06-12 19:40:45 [parallel_state.py:1422] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:48603 backend=nccl
(Worker pid=678) INFO 06-12 19:40:45 [parallel_state.py:1422] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:48603 backend=nccl
(Worker pid=677) INFO 06-12 19:40:45 [pynccl.py:113] vLLM is using nccl==2.28.9
(Worker pid=678) WARNING 06-12 19:40:45 [symm_mem.py:106] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(Worker pid=677) WARNING 06-12 19:40:45 [symm_mem.py:106] SymmMemCommunicator: symmetric memory multicast operations are not supported.
(Worker pid=677) INFO 06-12 19:40:45 [cuda_communicator.py:232] Using ['CUSTOM', 'PYNCCL'] all-reduce backends (in dispatch order) for group 'tp:0' out of potential backends: ['NCCL_SYMM_MEM', 'QUICK_REDUCE', 'FLASHINFER', 'CUSTOM', 'SYMM_MEM', 'PYNCCL'].
(Worker pid=677) INFO 06-12 19:40:46 [cuda_communicator.py:232] Using ['PYNCCL'] all-reduce backends (in dispatch order) for group 'ep:0' out of potential backends: ['NCCL_SYMM_MEM', 'QUICK_REDUCE', 'FLASHINFER', 'CUSTOM', 'SYMM_MEM', 'PYNCCL'].
(Worker pid=677) INFO 06-12 19:40:46 [parallel_state.py:1735] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=677) INFO 06-12 19:40:46 [topk_topp_sampler.py:70] FlashInfer top-p/top-k sampling disabled via VLLM_USE_FLASHINFER_SAMPLER=0; using PyTorch-native sampler.
(Worker_TP0 pid=677) INFO 06-12 19:40:46 [gpu_model_runner.py:5037] Starting to load model deepseek-ai/deepseek-v4-flash...
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [quant_config.py:73] DeepSeek V4 expert_dtype resolved to 'fp4'
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [__init__.py:533] Selected FlashInferFp8DeepGEMMDynamicBlockScaledKernel for Fp8LinearMethod
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [deep_gemm.py:113] DeepGEMM E8M0 enabled on current platform.
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [attention.py:688] Using DeepSeek's fp8_ds_mla KV cache format.
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [mxfp4.py:647] Using 'MARLIN' Mxfp4 MoE backend.
(Worker_TP0 pid=677) INFO 06-12 19:40:47 [attention.py:797] Using FP8 indexer cache for Lightning Indexer.
(APIServer pid=45) INFO 06-12 19:40:49 inference_api.py:328] Model deepseek-ai/deepseek-v4-flash download complete: 0.01 GiB on disk in 48.2s
(Worker_TP0 pid=677) INFO 06-12 19:41:31 [weight_utils.py:603] Time spent downloading weights for deepseek-ai/deepseek-v4-flash: 41.932006 seconds
(Worker_TP0 pid=677) INFO 06-12 19:41:31 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 148.66 GiB. Available RAM: 609.93 GiB.
(Worker_TP0 pid=677) INFO 06-12 19:41:31 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:   0% Completed | 0/46 [00:00<?, ?it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:   4% Completed | 2/46 [00:00<00:12,  3.52it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:   7% Completed | 3/46 [00:01<00:17,  2.49it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:   9% Completed | 4/46 [00:01<00:19,  2.16it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  11% Completed | 5/46 [00:02<00:20,  2.00it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  13% Completed | 6/46 [00:02<00:20,  1.92it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  15% Completed | 7/46 [00:03<00:20,  1.86it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  17% Completed | 8/46 [00:03<00:20,  1.83it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  20% Completed | 9/46 [00:04<00:20,  1.82it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  22% Completed | 10/46 [00:05<00:19,  1.81it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  24% Completed | 11/46 [00:05<00:19,  1.80it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  26% Completed | 12/46 [00:06<00:19,  1.79it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  28% Completed | 13/46 [00:06<00:18,  1.79it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  30% Completed | 14/46 [00:07<00:17,  1.79it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  33% Completed | 15/46 [00:07<00:17,  1.79it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  35% Completed | 16/46 [00:08<00:16,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  37% Completed | 17/46 [00:09<00:16,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  39% Completed | 18/46 [00:09<00:15,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  41% Completed | 19/46 [00:10<00:15,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  43% Completed | 20/46 [00:10<00:14,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  46% Completed | 21/46 [00:11<00:14,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  48% Completed | 22/46 [00:11<00:13,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  50% Completed | 23/46 [00:12<00:12,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  52% Completed | 24/46 [00:12<00:12,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  54% Completed | 25/46 [00:13<00:11,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  57% Completed | 26/46 [00:14<00:11,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  59% Completed | 27/46 [00:14<00:10,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  61% Completed | 28/46 [00:15<00:10,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  63% Completed | 29/46 [00:15<00:09,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  65% Completed | 30/46 [00:16<00:09,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  67% Completed | 31/46 [00:16<00:08,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  70% Completed | 32/46 [00:17<00:07,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  72% Completed | 33/46 [00:18<00:07,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  74% Completed | 34/46 [00:18<00:06,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  76% Completed | 35/46 [00:19<00:06,  1.77it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  78% Completed | 36/46 [00:19<00:05,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  80% Completed | 37/46 [00:20<00:05,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  83% Completed | 38/46 [00:20<00:04,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  85% Completed | 39/46 [00:21<00:03,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  87% Completed | 40/46 [00:22<00:03,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  89% Completed | 41/46 [00:22<00:02,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  91% Completed | 42/46 [00:23<00:02,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  93% Completed | 43/46 [00:23<00:01,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  96% Completed | 44/46 [00:24<00:01,  1.78it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards:  98% Completed | 45/46 [00:24<00:00,  2.29it/s]
(Worker_TP0 pid=677) 
Loading safetensors checkpoint shards: 100% Completed | 46/46 [00:24<00:00,  1.88it/s]
(Worker_TP0 pid=677) 
(Worker_TP0 pid=677) INFO 06-12 19:41:56 [default_loader.py:397] Loading weights took 24.45 seconds
(Worker_TP0 pid=677) INFO 06-12 19:41:56 [mxfp4.py:1695] Using MoEPrepareAndFinalizeNoDPEPModular
(Worker_TP0 pid=677) INFO 06-12 19:42:01 [gpu_model_runner.py:5132] Model loading took 74.08 GiB memory and 73.197066 seconds
(Worker_TP0 pid=677) INFO 06-12 19:42:01 [interface.py:496] Setting kv cache block size to 256 for DEEPSEEK_SPARSE_SWA backend.
(Worker_TP1 pid=678) INFO 06-12 19:42:03 [interface.py:496] Setting kv cache block size to 256 for DEEPSEEK_SPARSE_SWA backend.
(Worker_TP0 pid=677) INFO 06-12 19:42:09 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/cc2572f778/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=677) INFO 06-12 19:42:09 [backends.py:1148] Dynamo bytecode transform time: 5.21 s
(Worker_TP0 pid=677) INFO 06-12 19:42:09 [flashinfer_all_reduce.py:111] Auto-selected flashinfer allreduce backend: trtllm
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] Failed to initialize FlashInfer All Reduce workspace: Ninja build failed. Ninja output:
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] ninja: Entering directory `/root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm'
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] [1/4]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_allreduce.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] FAILED: [code=127] /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_allreduce.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] [2/4]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] FAILED: [code=127] /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] [3/4]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] FAILED: [code=127] /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74]  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile -MF /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/cub -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/libcudacxx/include -I/usr/local/lib/python3.12/site-packages/flashinfer/data/cccl/thrust -isystem /usr/local/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o /root/.cache/flashinfer/0.6.11.post2/90a/cached_ops/trtllm_comm/csrc_trtllm_moe_allreduce_fusion.cuda.o 
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] ninja: build stopped: subcommand failed.
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:74] .
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [flashinfer_all_reduce.py:156] Failed to initialize FlashInfer Allreduce norm fusion workspace with backend=trtllm
(Worker_TP0 pid=677) WARNING 06-12 19:42:09 [allreduce_rms_fusion.py:820] Failed to initialize Flashinfer allreduce workspace. Flashinfer allreduce-norm fusion will be disabled.
(Worker_TP0 pid=677) WARNING 06-12 19:42:11 [allreduce_rms_fusion.py:893] AllReduce fusion pass is disabled.
(Worker_TP0 pid=677) INFO 06-12 19:42:11 [backends.py:378] Cache the graph of compile range (1, 8192) for later use
(Worker_TP0 pid=677) INFO 06-12 19:42:14 [backends.py:393] Compiling a graph for compile range (1, 8192) takes 4.09 s
(Worker_TP0 pid=677) INFO 06-12 19:42:23 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/913996eeb271f2db39df794c131abfe576b317e1205d345660c0e962ea19dd07/rank_0_0/model
(Worker_TP0 pid=677) INFO 06-12 19:42:23 [monitor.py:53] torch.compile took 19.04 s in total
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] WorkerProc hit an exception.
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] Traceback (most recent call last):
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     output = func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 396, in determine_available_memory
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     self.model_runner.profile_run()
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 6164, in profile_run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                                         ^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5824, in _dummy_run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     outputs = self.model(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]               ^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/models/deepseek_v4/nvidia/model.py", line 1458, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     hidden_states = self.model(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                     ^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 670, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     output = self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/models/deepseek_v4/nvidia/model.py", line 1202, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     def forward(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/caching.py", line 217, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "<string>", line 177, in execution_fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return range_entry.runnable(*args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._compiled_fn(*args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return compiled_fn(full_args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     all_outs = call_func_at_runtime_with_args(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     out = normalize_as_list(f(args))
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                             ^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2298, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.compiled_fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return compiled_fn(runtime_args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     outs = compiled_fn(args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.current_callable(inputs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     out = model(new_inputs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]           ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/913996eeb271f2db39df794c131abfe576b317e1205d345660c0e962ea19dd07/inductor_cache/qc/cqctjpbr4d32ob7d4v44ybigloj3mholu57h4tl6eymi7jus3ntu.py", line 379, in call
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     buf4 = torch.ops.vllm.mhc_pre_tilelang.default(buf3, arg4_1, arg5_1, arg6_1, 1e-06, 1e-06, 1e-06, 2.0, 20, 1, arg3_1, norm_eps=1e-06)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return disable_fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     res = func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]           ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/kernels/mhc/tilelang.py", line 195, in mhc_pre_tilelang
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     tf32_hc_prenorm_gemm(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return _tf32_hc_prenorm_gemm_impl(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] RuntimeError: CUDA driver error (csrc/apis/../jit_kernels/impls/../../jit/handle.hpp:141): 200 (CUDA_ERROR_INVALID_IMAGE, device kernel image is invalid)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] Traceback (most recent call last):
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     output = func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 396, in determine_available_memory
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     self.model_runner.profile_run()
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 6164, in profile_run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                                         ^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5824, in _dummy_run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     outputs = self.model(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]               ^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/models/deepseek_v4/nvidia/model.py", line 1458, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     hidden_states = self.model(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                     ^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 670, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     output = self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/models/deepseek_v4/nvidia/model.py", line 1202, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     def forward(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/caching.py", line 217, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.optimized_call(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "<string>", line 177, in execution_fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.runnable(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 380, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return range_entry.runnable(*args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._compiled_fn(*args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return compiled_fn(full_args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 580, in runtime_wrapper
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     all_outs = call_func_at_runtime_with_args(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     out = normalize_as_list(f(args))
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]                             ^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2298, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.compiled_fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return compiled_fn(runtime_args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     outs = compiled_fn(args)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self.current_callable(inputs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     out = model(new_inputs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]           ^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/913996eeb271f2db39df794c131abfe576b317e1205d345660c0e962ea19dd07/inductor_cache/qc/cqctjpbr4d32ob7d4v44ybigloj3mholu57h4tl6eymi7jus3ntu.py", line 379, in call
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     buf4 = torch.ops.vllm.mhc_pre_tilelang.default(buf3, arg4_1, arg5_1, arg6_1, 1e-06, 1e-06, 1e-06, 2.0, 20, 1, arg3_1, norm_eps=1e-06)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return disable_fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return fn(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 409, in __torch_dispatch__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     res = func(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]           ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 865, in __call__
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return self._op(*args, **kwargs)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/kernels/mhc/tilelang.py", line 195, in mhc_pre_tilelang
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     tf32_hc_prenorm_gemm(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]   File "/usr/local/lib/python3.12/site-packages/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]     return _tf32_hc_prenorm_gemm_impl(
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] RuntimeError: CUDA driver error (csrc/apis/../jit_kernels/impls/../../jit/handle.hpp:141): 200 (CUDA_ERROR_INVALID_IMAGE, device kernel image is invalid)
(Worker_TP0 pid=677) ERROR 06-12 19:42:25 [multiproc_executor.py:962] 
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165] EngineCore failed to start.
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165] Traceback (most recent call last):
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1139, in run_engine_core
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return func(*args, **kwargs)
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 905, in __init__
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     super().__init__(
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 131, in __init__
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return func(*args, **kwargs)
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 253, in _initialize_kv_caches
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return future if non_block else future.result()
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                                     ^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return super().result()
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     return self.__get_result()
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     raise self._exception
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     response = self.aggregate(self.get_response())
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 390, in get_response
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165]     raise RuntimeError(
(EngineCore pid=515) ERROR 06-12 19:42:25 [core.py:1165] RuntimeError: Worker failed with error 'CUDA driver error (csrc/apis/../jit_kernels/impls/../../jit/handle.hpp:141): 200 (CUDA_ERROR_INVALID_IMAGE, device kernel image is invalid)', please check the stack trace above for the root cause
(Worker_TP0 pid=677) WARNING 06-12 19:42:25 [multiproc_executor.py:884] WorkerProc was terminated
(Worker_TP1 pid=678) WARNING 06-12 19:42:26 [multiproc_executor.py:884] WorkerProc was terminated
NVRTC log: "kernel.cu": creating precompiled header file "kernel.pch"
/usr/local/lib/python3.12/site-packages/deep_gemm/include/deep_gemm/impls/sm90_tf32_hc_prenorm_gemm.cuh(120): error: identifier "cudaGridDependencySynchronize" is undefined
      cudaGridDependencySynchronize();
      ^

1 error detected in the compilation of "kernel.cu".

NVRTC log: "kernel.cu": creating precompiled header file "kernel.pch"
/usr/local/lib/python3.12/site-packages/deep_gemm/include/deep_gemm/impls/sm90_tf32_hc_prenorm_gemm.cuh(120): error: identifier "cudaGridDependencySynchronize" is undefined
      cudaGridDependencySynchronize();
      ^

1 error detected in the compilation of "kernel.cu".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions