Skip to content

feat: add ViT encoder ACL Graph capture for Qwen3-VL.#1866

Open
sunbaosong wants to merge 1 commit into
jd-opensource:mainfrom
sunbaosong:vit-graph
Open

feat: add ViT encoder ACL Graph capture for Qwen3-VL.#1866
sunbaosong wants to merge 1 commit into
jd-opensource:mainfrom
sunbaosong:vit-graph

Conversation

@sunbaosong

@sunbaosong sunbaosong commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

背景

VLM 推理的首 token 延迟(TTFT)中,vision encoder 的多层前向带来显著的算子下发(op dispatch)开销。本变更将 ViT encoder 循环捕获为 ACL
Graph(NPUGraph),通过一次捕获、多次重放消除逐层 dispatch,降低 TTFT。

需求

为 Qwen3-VL 的 vision encoder 提供 ACL Graph 加速路径,并满足:

  1. 模型无关:graph 机制不耦合具体模型类型,可被其它 VLM 复用
  2. 可配置:开关 + bucket 列表可配
  3. 可降级:捕获失败 / 形状不匹配时自动回退 eager,不影响正确性
  4. 不破坏现有构建:非 NPU 平台不被波及

实现方案

公共 graph 机制(core/runtime)

参考 decoder 侧 acl_graph_executor_impl 的范式,在 xllm::npu 下提供模型无关的 encoder graph 基础设施:

  • 新增抽象接口 VisionEncoderGraphAdapternum_encoder_layers() / deepstack_indexes() / forward_encoder_layer() /
    forward_deepstack_merger()
  • VisionEncoderAclGraphManager 只通过 adapter 捕获与重放,不引用任何具体层类型 —— core/runtime 不再反向依赖 models/
  • Bucket 化懒捕获:首次请求命中某 bucket 时捕获,之后重放;persistent input/output buffer 预分配在 bucket 尺寸,保证 capture/replay
    引用稳定设备地址

模型侧接线(qwen3_vl)

  • Qwen3_VisionTransformerImpl 多继承 VisionEncoderGraphAdapter,实现 4 个方法(直接转发到既有 layers_ /
    deepstack_merger_layers_
  • init_encoder_graph_manager 创建 manager 并 set_adapter(this)
  • forwardcan_replayreplay 走 graph 路径,否则 / 重放失败走原 eager 循环

配置与开关

flag 默认值 说明
--enable_encoder_graph false 是否启用 encoder ACL graph
--encoder_graph_budgets 1024,2048,4096,8192 逗号分隔的 token bucket 列表

配置走 ExecutionConfig 标准 PROPERTY/DEFINE/from_flags/from_json 全套,enable_encoder_graph 同时并入 model_context
tiling-copy-stream 守卫与 npu_base_layer 的 ATB stream 守卫(与既有 enable_graph 同一约束:捕获态不能用默认流 / 不可逆 tiling copy
stream)。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces VisionEncoderAclGraphManager to support ACL graph execution for the vision encoder in Qwen3-VL, along with corresponding configuration options and integration into the VLM executor and model. The review feedback identifies several critical issues, including a redundant dynamic_cast, potential race conditions from incorrect stream synchronization during graph capture, and graph replay bugs caused by changing tensor memory addresses with set_data. Additionally, the reviewer recommends refactoring the hardcoded deepstack outputs to a dynamic vector and addresses multiple style guide violations regarding braces, final classes, relative includes, auto usage, constant argument annotations, and fixed-width integers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +47 to +49
auto* vlm_model = dynamic_cast<CausalVLM*>(model_);
if (vlm_model) {
vlm_model->init_encoder_graph_manager(args, device);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The member variable model_ is already declared as CausalVLM* in vlm_executor_impl.h. Performing a dynamic_cast<CausalVLM*> on it is redundant and unnecessary. You can directly check if model_ is non-null.

    if (model_) {
      model_->init_encoder_graph_manager(args, device);

Comment on lines +168 to +174
bool need_restore = false;
if (c10_npu::getCurrentNPUStream(device_index_) ==
c10_npu::getDefaultNPUStream(device_index_)) {
c10_npu::setCurrentNPUStream(*capture_stream_);
aclrtSynchronizeStream(capture_stream_->stream());
need_restore = true;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If need_restore is false (meaning the current stream was already a non-default stream), the capture stream is not switched to capture_stream_. However, at the end of capture, aclrtSynchronizeStream(capture_stream_->stream()) is still called. This synchronizes the wrong stream, which can lead to race conditions or undefined behavior. We should record the stream actually used for capture and synchronize it.

Suggested change
bool need_restore = false;
if (c10_npu::getCurrentNPUStream(device_index_) ==
c10_npu::getDefaultNPUStream(device_index_)) {
c10_npu::setCurrentNPUStream(*capture_stream_);
aclrtSynchronizeStream(capture_stream_->stream());
need_restore = true;
}
bool need_restore = false;
if (c10_npu::getCurrentNPUStream(device_index_) ==
c10_npu::getDefaultNPUStream(device_index_)) {
c10_npu::setCurrentNPUStream(*capture_stream_);
aclrtSynchronizeStream(capture_stream_->stream());
need_restore = true;
}
aclrtStream capture_raw_stream = c10_npu::getCurrentNPUStream(device_index_).stream();

}

// Verify with test replay
aclrtSynchronizeStream(capture_stream_->stream());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Synchronize the stream that was actually used for the capture instead of hardcoding capture_stream_->stream(), which might not have been used if need_restore was false.

Suggested change
aclrtSynchronizeStream(capture_stream_->stream());
aclrtSynchronizeStream(capture_raw_stream);

Comment on lines +104 to +107
auto cu_tensor = torch::tensor(
param.cu_seqlens_vec,
torch::TensorOptions().device(device_).dtype(torch::kInt32));
param.cu_seqlens.set_data(cu_tensor);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using param.cu_seqlens.set_data(cu_tensor) changes the underlying memory address of the tensor. Since the ACL/NPU graph captures the physical memory addresses of tensors at capture time, changing the address during replay means the graph will continue to read from the old address (which may contain stale data or have been deallocated), leading to incorrect results or crashes. Instead, you should copy the data into the existing tensor to preserve its memory address when the shapes match.

  auto cu_tensor = torch::tensor(
      param.cu_seqlens_vec,
      torch::TensorOptions().device(device_).dtype(torch::kInt32));
  if (param.cu_seqlens.sizes() == cu_tensor.sizes()) {
    param.cu_seqlens.copy_(cu_tensor);
  } else {
    param.cu_seqlens.set_data(cu_tensor);
  }

Comment on lines +106 to +108
torch::Tensor deepstack_out_0; // [budget/merge_sq, d_model]
torch::Tensor deepstack_out_1; // [budget/merge_sq, d_model]
torch::Tensor deepstack_out_2; // [budget/merge_sq, d_model]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding deepstack_out_0, deepstack_out_1, and deepstack_out_2 makes the class fragile and limited to exactly 3 deepstacks. If a model has a different number of deepstacks, this will waste memory or fail. Refactor this to use a std::vector<torch::Tensor> to support any number of deepstacks dynamically.

Suggested change
torch::Tensor deepstack_out_0; // [budget/merge_sq, d_model]
torch::Tensor deepstack_out_1; // [budget/merge_sq, d_model]
torch::Tensor deepstack_out_2; // [budget/merge_sq, d_model]
std::vector<torch::Tensor> deepstack_outs;

device_index_(device.index()),
dtype_(args.dtype() == "bfloat16" ? torch::kBFloat16 : torch::kFloat16) {
max_budget_ = budgets_.empty() ? 0 : budgets_.back();
capture_stream_.emplace(c10_npu::getStreamFromPool(true, device_index_));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors.

Suggested change
capture_stream_.emplace(c10_npu::getStreamFromPool(true, device_index_));
capture_stream_.emplace(c10_npu::getStreamFromPool(/*isHighPriority=*/true, device_index_));
References
  1. Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. (link)

{
LOG(INFO) << "[EncoderGraph] Warm-up run with budget=" << budget;
torch::Tensor warm_hidden = param.hidden_states.clone();
for (int i = 0; i < static_cast<int>(layers.size()); ++i) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use fixed-width integers (int32_t, int64_t) instead of plain int.

Suggested change
for (int i = 0; i < static_cast<int>(layers.size()); ++i) {
for (int32_t i = 0; i < static_cast<int32_t>(layers.size()); ++i) {
References
  1. Use fixed-width integers (int32_t, int64_t) instead of plain int. (link)


// Run 27-layer encoder loop + deepstack mergers
torch::Tensor hidden = param.hidden_states;
for (int i = 0; i < static_cast<int>(layers.size()); ++i) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use fixed-width integers (int32_t, int64_t) instead of plain int.

Suggested change
for (int i = 0; i < static_cast<int>(layers.size()); ++i) {
for (int32_t i = 0; i < static_cast<int32_t>(layers.size()); ++i) {
References
  1. Use fixed-width integers (int32_t, int64_t) instead of plain int. (link)

? "true"
: "false")
: "N/A");
for (int idx = 0; idx < static_cast<int>(layers_.size()); ++idx) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use fixed-width integers (int32_t, int64_t) instead of plain int.

Suggested change
for (int idx = 0; idx < static_cast<int>(layers_.size()); ++idx) {
for (int32_t idx = 0; idx < static_cast<int32_t>(layers_.size()); ++idx) {
References
  1. Use fixed-width integers (int32_t, int64_t) instead of plain int. (link)

deepstack_visual_indexes_.end(),
idx);
if (it != deepstack_visual_indexes_.end()) {
int index = std::distance(deepstack_visual_indexes_.begin(), it);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use fixed-width integers (int32_t, int64_t) instead of plain int, and use static_cast for type conversions.

Suggested change
int index = std::distance(deepstack_visual_indexes_.begin(), it);
int32_t index = static_cast<int32_t>(std::distance(deepstack_visual_indexes_.begin(), it));
References
  1. Use fixed-width integers (int32_t, int64_t) instead of plain int. (link)

@XuZhang99 XuZhang99 changed the title feat: add ViT encoder ACL Graph capture for Qwen3-VL feat: add ViT encoder ACL Graph capture for Qwen3-VL. Jul 1, 2026
Capture the vision encoder loop as an ACL Graph (NPUGraph) to cut operator
dispatch latency and lower TTFT for VLM inference.

The graph manager (core/runtime/vision_encoder_acl_graph_manager) is
model-agnostic: it captures/replays purely through the VisionEncoderGraphAdapter
interface in xllm::npu, so core/runtime never depends on a concrete model type.
Qwen3_VisionTransformerImpl implements the adapter and registers itself via
set_adapter(this).

- Bucket-based lazy capture/replay with persistent input/output buffers
- Stable-storage cu_seqlens: in-place copy_ into a pre-allocated device tensor
  (no set_data storage swap) so the captured graph reads a valid address on
  every replay
- Segment-count guard: a request whose image count differs from the captured
  graph falls back to eager instead of corrupting attention
- capture() wrapped in try/catch with stream restore + eager fallback on failure
- head_dim sourced from args.mm_head_dim() (not derived) to support GQA
- Config: --enable_encoder_graph, --encoder_graph_budgets

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant