feat: add ViT encoder ACL Graph capture for Qwen3-VL. by sunbaosong · Pull Request #1866 · jd-opensource/xllm

sunbaosong · 2026-07-01T08:30:09Z

背景

VLM 推理的首 token 延迟（TTFT）中，vision encoder 的多层前向带来显著的算子下发（op dispatch）开销。本变更将 ViT encoder 循环捕获为 ACL
Graph（NPUGraph），通过一次捕获、多次重放消除逐层 dispatch，降低 TTFT。

需求

为 Qwen3-VL 的 vision encoder 提供 ACL Graph 加速路径，并满足：

模型无关：graph 机制不耦合具体模型类型，可被其它 VLM 复用
可配置：开关 + bucket 列表可配
可降级：捕获失败 / 形状不匹配时自动回退 eager，不影响正确性
不破坏现有构建：非 NPU 平台不被波及

实现方案

公共 graph 机制（core/runtime）

参考 decoder 侧 acl_graph_executor_impl 的范式，在 xllm::npu 下提供模型无关的 encoder graph 基础设施：

新增抽象接口 VisionEncoderGraphAdapter：num_encoder_layers() / deepstack_indexes() / forward_encoder_layer() /
forward_deepstack_merger()
VisionEncoderAclGraphManager 只通过 adapter 捕获与重放，不引用任何具体层类型 —— core/runtime 不再反向依赖 models/
Bucket 化懒捕获：首次请求命中某 bucket 时捕获，之后重放；persistent input/output buffer 预分配在 bucket 尺寸，保证 capture/replay
引用稳定设备地址

模型侧接线（qwen3_vl）

Qwen3_VisionTransformerImpl 多继承 VisionEncoderGraphAdapter，实现 4 个方法（直接转发到既有 layers_ /
deepstack_merger_layers_）
init_encoder_graph_manager 创建 manager 并 set_adapter(this)
forward 中 can_replay → replay 走 graph 路径，否则 / 重放失败走原 eager 循环

配置与开关

flag	默认值	说明
`--enable_encoder_graph`	`false`	是否启用 encoder ACL graph
`--encoder_graph_budgets`	`1024,2048,4096,8192`	逗号分隔的 token bucket 列表

配置走 ExecutionConfig 标准 PROPERTY/DEFINE/from_flags/from_json 全套，enable_encoder_graph 同时并入 model_context 的
tiling-copy-stream 守卫与 npu_base_layer 的 ATB stream 守卫（与既有 enable_graph 同一约束：捕获态不能用默认流 / 不可逆 tiling copy
stream）。

gemini-code-assist

Code Review

This pull request introduces VisionEncoderAclGraphManager to support ACL graph execution for the vision encoder in Qwen3-VL, along with corresponding configuration options and integration into the VLM executor and model. The review feedback identifies several critical issues, including a redundant dynamic_cast, potential race conditions from incorrect stream synchronization during graph capture, and graph replay bugs caused by changing tensor memory addresses with set_data. Additionally, the reviewer recommends refactoring the hardcoded deepstack outputs to a dynamic vector and addresses multiple style guide violations regarding braces, final classes, relative includes, auto usage, constant argument annotations, and fixed-width integers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-01T08:35:33Z

+    auto* vlm_model = dynamic_cast<CausalVLM*>(model_);
+    if (vlm_model) {
+      vlm_model->init_encoder_graph_manager(args, device);


The member variable model_ is already declared as CausalVLM* in vlm_executor_impl.h. Performing a dynamic_cast<CausalVLM*> on it is redundant and unnecessary. You can directly check if model_ is non-null.

if (model_) { model_->init_encoder_graph_manager(args, device);

gemini-code-assist · 2026-07-01T08:35:33Z

+  bool need_restore = false;
+  if (c10_npu::getCurrentNPUStream(device_index_) ==
+      c10_npu::getDefaultNPUStream(device_index_)) {
+    c10_npu::setCurrentNPUStream(*capture_stream_);
+    aclrtSynchronizeStream(capture_stream_->stream());
+    need_restore = true;
+  }


If need_restore is false (meaning the current stream was already a non-default stream), the capture stream is not switched to capture_stream_. However, at the end of capture, aclrtSynchronizeStream(capture_stream_->stream()) is still called. This synchronizes the wrong stream, which can lead to race conditions or undefined behavior. We should record the stream actually used for capture and synchronize it.

Suggested change

bool need_restore = false;

if (c10_npu::getCurrentNPUStream(device_index_) ==

c10_npu::getDefaultNPUStream(device_index_)) {

c10_npu::setCurrentNPUStream(*capture_stream_);

aclrtSynchronizeStream(capture_stream_->stream());

need_restore = true;

}

bool need_restore = false;

if (c10_npu::getCurrentNPUStream(device_index_) ==

c10_npu::getDefaultNPUStream(device_index_)) {

c10_npu::setCurrentNPUStream(*capture_stream_);

aclrtSynchronizeStream(capture_stream_->stream());

need_restore = true;

}

aclrtStream capture_raw_stream = c10_npu::getCurrentNPUStream(device_index_).stream();

gemini-code-assist · 2026-07-01T08:35:34Z

+  }
+
+  // Verify with test replay
+  aclrtSynchronizeStream(capture_stream_->stream());


Synchronize the stream that was actually used for the capture instead of hardcoding capture_stream_->stream(), which might not have been used if need_restore was false.

Suggested change

aclrtSynchronizeStream(capture_stream_->stream());

aclrtSynchronizeStream(capture_raw_stream);

gemini-code-assist · 2026-07-01T08:35:34Z

+  auto cu_tensor = torch::tensor(
+      param.cu_seqlens_vec,
+      torch::TensorOptions().device(device_).dtype(torch::kInt32));
+  param.cu_seqlens.set_data(cu_tensor);


Using param.cu_seqlens.set_data(cu_tensor) changes the underlying memory address of the tensor. Since the ACL/NPU graph captures the physical memory addresses of tensors at capture time, changing the address during replay means the graph will continue to read from the old address (which may contain stale data or have been deallocated), leading to incorrect results or crashes. Instead, you should copy the data into the existing tensor to preserve its memory address when the shapes match.

auto cu_tensor = torch::tensor( param.cu_seqlens_vec, torch::TensorOptions().device(device_).dtype(torch::kInt32)); if (param.cu_seqlens.sizes() == cu_tensor.sizes()) { param.cu_seqlens.copy_(cu_tensor); } else { param.cu_seqlens.set_data(cu_tensor); }

gemini-code-assist · 2026-07-01T08:35:34Z

+    torch::Tensor deepstack_out_0;  // [budget/merge_sq, d_model]
+    torch::Tensor deepstack_out_1;  // [budget/merge_sq, d_model]
+    torch::Tensor deepstack_out_2;  // [budget/merge_sq, d_model]


Hardcoding deepstack_out_0, deepstack_out_1, and deepstack_out_2 makes the class fragile and limited to exactly 3 deepstacks. If a model has a different number of deepstacks, this will waste memory or fail. Refactor this to use a std::vector<torch::Tensor> to support any number of deepstacks dynamically.

Suggested change

torch::Tensor deepstack_out_0; // [budget/merge_sq, d_model]

torch::Tensor deepstack_out_1; // [budget/merge_sq, d_model]

torch::Tensor deepstack_out_2; // [budget/merge_sq, d_model]

std::vector<torch::Tensor> deepstack_outs;

gemini-code-assist · 2026-07-01T08:35:34Z

+      device_index_(device.index()),
+      dtype_(args.dtype() == "bfloat16" ? torch::kBFloat16 : torch::kFloat16) {
+  max_budget_ = budgets_.empty() ? 0 : budgets_.back();
+  capture_stream_.emplace(c10_npu::getStreamFromPool(true, device_index_));


Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors.

Suggested change

capture_stream_.emplace(c10_npu::getStreamFromPool(true, device_index_));

capture_stream_.emplace(c10_npu::getStreamFromPool(/*isHighPriority=*/true, device_index_));

References

Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. ^(link)

gemini-code-assist · 2026-07-01T08:35:34Z

+  {
+    LOG(INFO) << "[EncoderGraph] Warm-up run with budget=" << budget;
+    torch::Tensor warm_hidden = param.hidden_states.clone();
+    for (int i = 0; i < static_cast<int>(layers.size()); ++i) {


Use fixed-width integers (int32_t, int64_t) instead of plain int.

Suggested change

for (int i = 0; i < static_cast<int>(layers.size()); ++i) {

for (int32_t i = 0; i < static_cast<int32_t>(layers.size()); ++i) {

References

Use fixed-width integers (int32_t, int64_t) instead of plain int. ^(link)

gemini-code-assist · 2026-07-01T08:35:34Z

+
+  // Run 27-layer encoder loop + deepstack mergers
+  torch::Tensor hidden = param.hidden_states;
+  for (int i = 0; i < static_cast<int>(layers.size()); ++i) {


Use fixed-width integers (int32_t, int64_t) instead of plain int.

Suggested change

for (int i = 0; i < static_cast<int>(layers.size()); ++i) {

for (int32_t i = 0; i < static_cast<int32_t>(layers.size()); ++i) {

References

Use fixed-width integers (int32_t, int64_t) instead of plain int. ^(link)

gemini-code-assist · 2026-07-01T08:35:34Z

+                               ? "true"
+                               : "false")
+                        : "N/A");
+      for (int idx = 0; idx < static_cast<int>(layers_.size()); ++idx) {


Use fixed-width integers (int32_t, int64_t) instead of plain int.

Suggested change

for (int idx = 0; idx < static_cast<int>(layers_.size()); ++idx) {

for (int32_t idx = 0; idx < static_cast<int32_t>(layers_.size()); ++idx) {

References

Use fixed-width integers (int32_t, int64_t) instead of plain int. ^(link)

gemini-code-assist · 2026-07-01T08:35:35Z

+                            deepstack_visual_indexes_.end(),
+                            idx);
+        if (it != deepstack_visual_indexes_.end()) {
+          int index = std::distance(deepstack_visual_indexes_.begin(), it);


Use fixed-width integers (int32_t, int64_t) instead of plain int, and use static_cast for type conversions.

Suggested change

int index = std::distance(deepstack_visual_indexes_.begin(), it);

int32_t index = static_cast<int32_t>(std::distance(deepstack_visual_indexes_.begin(), it));

References

Use fixed-width integers (int32_t, int64_t) instead of plain int. ^(link)

Capture the vision encoder loop as an ACL Graph (NPUGraph) to cut operator dispatch latency and lower TTFT for VLM inference. The graph manager (core/runtime/vision_encoder_acl_graph_manager) is model-agnostic: it captures/replays purely through the VisionEncoderGraphAdapter interface in xllm::npu, so core/runtime never depends on a concrete model type. Qwen3_VisionTransformerImpl implements the adapter and registers itself via set_adapter(this). - Bucket-based lazy capture/replay with persistent input/output buffers - Stable-storage cu_seqlens: in-place copy_ into a pre-allocated device tensor (no set_data storage swap) so the captured graph reads a valid address on every replay - Segment-count guard: a request whose image count differs from the captured graph falls back to eager instead of corrupting attention - capture() wrapped in try/catch with stream restore + eager fallback on failure - head_dim sourced from args.mm_head_dim() (not derived) to support GQA - Config: --enable_encoder_graph, --encoder_graph_budgets Co-Authored-By: Claude <noreply@anthropic.com>

sunbaosong requested review from Clement-Wang26, DongheJin, DragonFive, JimHsiung, Kang-Meng, RobbieLeung, XuZhang99, liujinguang0125, liutongxuan, walsonyang, xiao-yu-chen, yingxudeng, yq33victor and zhang-minchao as code owners July 1, 2026 08:30

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

XuZhang99 changed the title ~~feat: add ViT encoder ACL Graph capture for Qwen3-VL~~ feat: add ViT encoder ACL Graph capture for Qwen3-VL. Jul 1, 2026

sunbaosong force-pushed the vit-graph branch from 98ec508 to a5a8da4 Compare July 1, 2026 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add ViT encoder ACL Graph capture for Qwen3-VL.#1866

feat: add ViT encoder ACL Graph capture for Qwen3-VL.#1866
sunbaosong wants to merge 1 commit into
jd-opensource:mainfrom
sunbaosong:vit-graph

sunbaosong commented Jul 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	aclrtSynchronizeStream(capture_stream_->stream());
	aclrtSynchronizeStream(capture_raw_stream);

	capture_stream_.emplace(c10_npu::getStreamFromPool(true, device_index_));
	capture_stream_.emplace(c10_npu::getStreamFromPool(/isHighPriority=/true, device_index_));

	for (int i = 0; i < static_cast<int>(layers.size()); ++i) {
	for (int32_t i = 0; i < static_cast<int32_t>(layers.size()); ++i) {

	for (int idx = 0; idx < static_cast<int>(layers_.size()); ++idx) {
	for (int32_t idx = 0; idx < static_cast<int32_t>(layers_.size()); ++idx) {

	int index = std::distance(deepstack_visual_indexes_.begin(), it);
	int32_t index = static_cast<int32_t>(std::distance(deepstack_visual_indexes_.begin(), it));

Uh oh!

Conversation

sunbaosong commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

背景

需求

实现方案

公共 graph 机制（core/runtime）

模型侧接线（qwen3_vl）

配置与开关

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sunbaosong commented Jul 1, 2026 •

edited

Loading