Skip to content

[FA] reuse max_seq_len for prefill_max_seq_len#286

Open
sxvvv wants to merge 1 commit into
MetaX-MACA:masterfrom
sxvvv:opt/flash-attn-prefill-max-seq-len-v2
Open

[FA] reuse max_seq_len for prefill_max_seq_len#286
sxvvv wants to merge 1 commit into
MetaX-MACA:masterfrom
sxvvv:opt/flash-attn-prefill-max-seq-len-v2

Conversation

@sxvvv

@sxvvv sxvvv commented Jun 13, 2026

Copy link
Copy Markdown

Purpose

Follow-up to #260. When building FlashAttention metadata for prefill, the
builder slices common_attn_metadata.seq_lens_cpu[num_decodes:num_reqs] and
takes its .max() to fill prefill_max_seq_len:

prefill_seq_lens_cpu = common_attn_metadata.seq_lens_cpu[num_decodes:num_reqs]
prefill_max_seq_len = int(prefill_seq_lens_cpu.max().item())

prefill_max_seq_len is only consumed as the max_seqlen_k argument of
flash_attn_varlen_func on the prefill-decode split path, i.e. a launch-time
upper bound. max_seq_len is already computed at the top of build() as the
max over all rows, so it is a valid (and tight, since decode rows are short)
bound for the prefill rows too. Reusing it lets us drop the per-build
seq_lens_cpu slice entirely:

prefill_max_seq_len = max_seq_len

This also removes the last seq_lens_cpu access in this builder. Upstream has
deprecated that property in favour of using the device seq_lens directly, so
dropping it keeps the backend aligned with upstream.

Test Plan

The change is intended to be behaviour-preserving, so the goal is to show the
value handed to the kernel is unchanged in effect:

  • max_seq_len = max(seq_lens) over all requests ≥ max(seq_lens[prefill]).
  • prefill_max_seq_len flows only into max_seqlen_k, which FlashAttention
    treats as an upper bound on the K sequence length.

Test Result

max_seqlen_k is >= the previous value and remains an upper bound, so kernel
outputs are unchanged. The removed line was the only consumer of
prefill_seq_lens_cpu; prefill_seq_lens (device) is still used to build
cu_prefix_kv_lens. python -m py_compile passes.

I don't have a MACA runtime matching current master to run the full backend
suite end to end, so this is reviewed as a behaviour-preserving simplification
rather than a benchmarked perf change; happy to add numbers if a maintainer can
point me at the right harness.

(Optional) Documentation Update

None.

prefill_max_seq_len only feeds max_seqlen_k, which is a launch-time upper
bound. max_seq_len already bounds every prefill row, so slicing
seq_lens_cpu just to take its max is unnecessary. Reuse the precomputed
scalar and drop the seq_lens_cpu access, which upstream has deprecated in
favor of using device seq_lens directly.

Signed-off-by: sxvvv <58096390+sxvvv@users.noreply.github.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the calculation of prefill_max_seq_len in vllm_metax/v1/attention/backends/flash_attn.py by reusing max_seq_len directly instead of slicing and taking the maximum of the deprecated seq_lens_cpu tensor. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant