[FA] reuse max_seq_len for prefill_max_seq_len by sxvvv · Pull Request #286 · MetaX-MACA/vLLM-metax

sxvvv · 2026-06-13T02:07:11Z

Purpose

Follow-up to #260. When building FlashAttention metadata for prefill, the
builder slices common_attn_metadata.seq_lens_cpu[num_decodes:num_reqs] and
takes its .max() to fill prefill_max_seq_len:

prefill_seq_lens_cpu = common_attn_metadata.seq_lens_cpu[num_decodes:num_reqs]
prefill_max_seq_len = int(prefill_seq_lens_cpu.max().item())

prefill_max_seq_len is only consumed as the max_seqlen_k argument of
flash_attn_varlen_func on the prefill-decode split path, i.e. a launch-time
upper bound. max_seq_len is already computed at the top of build() as the
max over all rows, so it is a valid (and tight, since decode rows are short)
bound for the prefill rows too. Reusing it lets us drop the per-build
seq_lens_cpu slice entirely:

prefill_max_seq_len = max_seq_len

This also removes the last seq_lens_cpu access in this builder. Upstream has
deprecated that property in favour of using the device seq_lens directly, so
dropping it keeps the backend aligned with upstream.

Test Plan

The change is intended to be behaviour-preserving, so the goal is to show the
value handed to the kernel is unchanged in effect:

max_seq_len = max(seq_lens) over all requests ≥ max(seq_lens[prefill]).
prefill_max_seq_len flows only into max_seqlen_k, which FlashAttention
treats as an upper bound on the K sequence length.

Test Result

max_seqlen_k is >= the previous value and remains an upper bound, so kernel
outputs are unchanged. The removed line was the only consumer of
prefill_seq_lens_cpu; prefill_seq_lens (device) is still used to build
cu_prefix_kv_lens. python -m py_compile passes.

I don't have a MACA runtime matching current master to run the full backend
suite end to end, so this is reviewed as a behaviour-preserving simplification
rather than a benchmarked perf change; happy to add numbers if a maintainer can
point me at the right harness.

(Optional) Documentation Update

None.

prefill_max_seq_len only feeds max_seqlen_k, which is a launch-time upper bound. max_seq_len already bounds every prefill row, so slicing seq_lens_cpu just to take its max is unnecessary. Reuse the precomputed scalar and drop the seq_lens_cpu access, which upstream has deprecated in favor of using device seq_lens directly. Signed-off-by: sxvvv <58096390+sxvvv@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request simplifies the calculation of prefill_max_seq_len in vllm_metax/v1/attention/backends/flash_attn.py by reusing max_seq_len directly instead of slicing and taking the maximum of the deprecated seq_lens_cpu tensor. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FA] reuse max_seq_len for prefill_max_seq_len#286

[FA] reuse max_seq_len for prefill_max_seq_len#286
sxvvv wants to merge 1 commit into
MetaX-MACA:masterfrom
sxvvv:opt/flash-attn-prefill-max-seq-len-v2

sxvvv commented Jun 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sxvvv commented Jun 13, 2026

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant