feat: add ViT encoder ACL Graph capture for Qwen3-VL.#1866
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces VisionEncoderAclGraphManager to support ACL graph execution for the vision encoder in Qwen3-VL, along with corresponding configuration options and integration into the VLM executor and model. The review feedback identifies several critical issues, including a redundant dynamic_cast, potential race conditions from incorrect stream synchronization during graph capture, and graph replay bugs caused by changing tensor memory addresses with set_data. Additionally, the reviewer recommends refactoring the hardcoded deepstack outputs to a dynamic vector and addresses multiple style guide violations regarding braces, final classes, relative includes, auto usage, constant argument annotations, and fixed-width integers.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| auto* vlm_model = dynamic_cast<CausalVLM*>(model_); | ||
| if (vlm_model) { | ||
| vlm_model->init_encoder_graph_manager(args, device); |
There was a problem hiding this comment.
| bool need_restore = false; | ||
| if (c10_npu::getCurrentNPUStream(device_index_) == | ||
| c10_npu::getDefaultNPUStream(device_index_)) { | ||
| c10_npu::setCurrentNPUStream(*capture_stream_); | ||
| aclrtSynchronizeStream(capture_stream_->stream()); | ||
| need_restore = true; | ||
| } |
There was a problem hiding this comment.
If need_restore is false (meaning the current stream was already a non-default stream), the capture stream is not switched to capture_stream_. However, at the end of capture, aclrtSynchronizeStream(capture_stream_->stream()) is still called. This synchronizes the wrong stream, which can lead to race conditions or undefined behavior. We should record the stream actually used for capture and synchronize it.
| bool need_restore = false; | |
| if (c10_npu::getCurrentNPUStream(device_index_) == | |
| c10_npu::getDefaultNPUStream(device_index_)) { | |
| c10_npu::setCurrentNPUStream(*capture_stream_); | |
| aclrtSynchronizeStream(capture_stream_->stream()); | |
| need_restore = true; | |
| } | |
| bool need_restore = false; | |
| if (c10_npu::getCurrentNPUStream(device_index_) == | |
| c10_npu::getDefaultNPUStream(device_index_)) { | |
| c10_npu::setCurrentNPUStream(*capture_stream_); | |
| aclrtSynchronizeStream(capture_stream_->stream()); | |
| need_restore = true; | |
| } | |
| aclrtStream capture_raw_stream = c10_npu::getCurrentNPUStream(device_index_).stream(); |
| } | ||
|
|
||
| // Verify with test replay | ||
| aclrtSynchronizeStream(capture_stream_->stream()); |
There was a problem hiding this comment.
| auto cu_tensor = torch::tensor( | ||
| param.cu_seqlens_vec, | ||
| torch::TensorOptions().device(device_).dtype(torch::kInt32)); | ||
| param.cu_seqlens.set_data(cu_tensor); |
There was a problem hiding this comment.
Using param.cu_seqlens.set_data(cu_tensor) changes the underlying memory address of the tensor. Since the ACL/NPU graph captures the physical memory addresses of tensors at capture time, changing the address during replay means the graph will continue to read from the old address (which may contain stale data or have been deallocated), leading to incorrect results or crashes. Instead, you should copy the data into the existing tensor to preserve its memory address when the shapes match.
auto cu_tensor = torch::tensor(
param.cu_seqlens_vec,
torch::TensorOptions().device(device_).dtype(torch::kInt32));
if (param.cu_seqlens.sizes() == cu_tensor.sizes()) {
param.cu_seqlens.copy_(cu_tensor);
} else {
param.cu_seqlens.set_data(cu_tensor);
}| torch::Tensor deepstack_out_0; // [budget/merge_sq, d_model] | ||
| torch::Tensor deepstack_out_1; // [budget/merge_sq, d_model] | ||
| torch::Tensor deepstack_out_2; // [budget/merge_sq, d_model] |
There was a problem hiding this comment.
Hardcoding deepstack_out_0, deepstack_out_1, and deepstack_out_2 makes the class fragile and limited to exactly 3 deepstacks. If a model has a different number of deepstacks, this will waste memory or fail. Refactor this to use a std::vector<torch::Tensor> to support any number of deepstacks dynamically.
| torch::Tensor deepstack_out_0; // [budget/merge_sq, d_model] | |
| torch::Tensor deepstack_out_1; // [budget/merge_sq, d_model] | |
| torch::Tensor deepstack_out_2; // [budget/merge_sq, d_model] | |
| std::vector<torch::Tensor> deepstack_outs; |
| device_index_(device.index()), | ||
| dtype_(args.dtype() == "bfloat16" ? torch::kBFloat16 : torch::kFloat16) { | ||
| max_budget_ = budgets_.empty() ? 0 : budgets_.back(); | ||
| capture_stream_.emplace(c10_npu::getStreamFromPool(true, device_index_)); |
There was a problem hiding this comment.
Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors.
| capture_stream_.emplace(c10_npu::getStreamFromPool(true, device_index_)); | |
| capture_stream_.emplace(c10_npu::getStreamFromPool(/*isHighPriority=*/true, device_index_)); |
References
- Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. (link)
| { | ||
| LOG(INFO) << "[EncoderGraph] Warm-up run with budget=" << budget; | ||
| torch::Tensor warm_hidden = param.hidden_states.clone(); | ||
| for (int i = 0; i < static_cast<int>(layers.size()); ++i) { |
There was a problem hiding this comment.
Use fixed-width integers (int32_t, int64_t) instead of plain int.
| for (int i = 0; i < static_cast<int>(layers.size()); ++i) { | |
| for (int32_t i = 0; i < static_cast<int32_t>(layers.size()); ++i) { |
References
- Use fixed-width integers (
int32_t,int64_t) instead of plainint. (link)
|
|
||
| // Run 27-layer encoder loop + deepstack mergers | ||
| torch::Tensor hidden = param.hidden_states; | ||
| for (int i = 0; i < static_cast<int>(layers.size()); ++i) { |
There was a problem hiding this comment.
Use fixed-width integers (int32_t, int64_t) instead of plain int.
| for (int i = 0; i < static_cast<int>(layers.size()); ++i) { | |
| for (int32_t i = 0; i < static_cast<int32_t>(layers.size()); ++i) { |
References
- Use fixed-width integers (
int32_t,int64_t) instead of plainint. (link)
| ? "true" | ||
| : "false") | ||
| : "N/A"); | ||
| for (int idx = 0; idx < static_cast<int>(layers_.size()); ++idx) { |
There was a problem hiding this comment.
Use fixed-width integers (int32_t, int64_t) instead of plain int.
| for (int idx = 0; idx < static_cast<int>(layers_.size()); ++idx) { | |
| for (int32_t idx = 0; idx < static_cast<int32_t>(layers_.size()); ++idx) { |
References
- Use fixed-width integers (
int32_t,int64_t) instead of plainint. (link)
| deepstack_visual_indexes_.end(), | ||
| idx); | ||
| if (it != deepstack_visual_indexes_.end()) { | ||
| int index = std::distance(deepstack_visual_indexes_.begin(), it); |
There was a problem hiding this comment.
Use fixed-width integers (int32_t, int64_t) instead of plain int, and use static_cast for type conversions.
| int index = std::distance(deepstack_visual_indexes_.begin(), it); | |
| int32_t index = static_cast<int32_t>(std::distance(deepstack_visual_indexes_.begin(), it)); |
References
- Use fixed-width integers (
int32_t,int64_t) instead of plainint. (link)
Capture the vision encoder loop as an ACL Graph (NPUGraph) to cut operator dispatch latency and lower TTFT for VLM inference. The graph manager (core/runtime/vision_encoder_acl_graph_manager) is model-agnostic: it captures/replays purely through the VisionEncoderGraphAdapter interface in xllm::npu, so core/runtime never depends on a concrete model type. Qwen3_VisionTransformerImpl implements the adapter and registers itself via set_adapter(this). - Bucket-based lazy capture/replay with persistent input/output buffers - Stable-storage cu_seqlens: in-place copy_ into a pre-allocated device tensor (no set_data storage swap) so the captured graph reads a valid address on every replay - Segment-count guard: a request whose image count differs from the captured graph falls back to eager instead of corrupting attention - capture() wrapped in try/catch with stream restore + eager fallback on failure - head_dim sourced from args.mm_head_dim() (not derived) to support GQA - Config: --enable_encoder_graph, --encoder_graph_budgets Co-Authored-By: Claude <noreply@anthropic.com>
背景
VLM 推理的首 token 延迟(TTFT)中,vision encoder 的多层前向带来显著的算子下发(op dispatch)开销。本变更将 ViT encoder 循环捕获为 ACL
Graph(NPUGraph),通过一次捕获、多次重放消除逐层 dispatch,降低 TTFT。
需求
为 Qwen3-VL 的 vision encoder 提供 ACL Graph 加速路径,并满足:
实现方案
公共 graph 机制(core/runtime)
参考 decoder 侧
acl_graph_executor_impl的范式,在xllm::npu下提供模型无关的 encoder graph 基础设施:VisionEncoderGraphAdapter:num_encoder_layers()/deepstack_indexes()/forward_encoder_layer()/forward_deepstack_merger()VisionEncoderAclGraphManager只通过 adapter 捕获与重放,不引用任何具体层类型 ——core/runtime不再反向依赖models/引用稳定设备地址
模型侧接线(qwen3_vl)
Qwen3_VisionTransformerImpl多继承VisionEncoderGraphAdapter,实现 4 个方法(直接转发到既有layers_/deepstack_merger_layers_)init_encoder_graph_manager创建 manager 并set_adapter(this)forward中can_replay→replay走 graph 路径,否则 / 重放失败走原 eager 循环配置与开关
--enable_encoder_graphfalse--encoder_graph_budgets1024,2048,4096,8192配置走
ExecutionConfig标准 PROPERTY/DEFINE/from_flags/from_json 全套,enable_encoder_graph同时并入model_context的tiling-copy-stream 守卫与
npu_base_layer的 ATB stream 守卫(与既有enable_graph同一约束:捕获态不能用默认流 / 不可逆 tiling copystream)。