[FA] reuse max_seq_len for prefill_max_seq_len#286
Open
sxvvv wants to merge 1 commit into
Open
Conversation
prefill_max_seq_len only feeds max_seqlen_k, which is a launch-time upper bound. max_seq_len already bounds every prefill row, so slicing seq_lens_cpu just to take its max is unnecessary. Reuse the precomputed scalar and drop the seq_lens_cpu access, which upstream has deprecated in favor of using device seq_lens directly. Signed-off-by: sxvvv <58096390+sxvvv@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request simplifies the calculation of prefill_max_seq_len in vllm_metax/v1/attention/backends/flash_attn.py by reusing max_seq_len directly instead of slicing and taking the maximum of the deprecated seq_lens_cpu tensor. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Follow-up to #260. When building FlashAttention metadata for prefill, the
builder slices
common_attn_metadata.seq_lens_cpu[num_decodes:num_reqs]andtakes its
.max()to fillprefill_max_seq_len:prefill_max_seq_lenis only consumed as themax_seqlen_kargument offlash_attn_varlen_funcon the prefill-decode split path, i.e. a launch-timeupper bound.
max_seq_lenis already computed at the top ofbuild()as themax over all rows, so it is a valid (and tight, since decode rows are short)
bound for the prefill rows too. Reusing it lets us drop the per-build
seq_lens_cpuslice entirely:This also removes the last
seq_lens_cpuaccess in this builder. Upstream hasdeprecated that property in favour of using the device
seq_lensdirectly, sodropping it keeps the backend aligned with upstream.
Test Plan
The change is intended to be behaviour-preserving, so the goal is to show the
value handed to the kernel is unchanged in effect:
max_seq_len = max(seq_lens)over all requests ≥max(seq_lens[prefill]).prefill_max_seq_lenflows only intomax_seqlen_k, which FlashAttentiontreats as an upper bound on the K sequence length.
Test Result
max_seqlen_kis>=the previous value and remains an upper bound, so kerneloutputs are unchanged. The removed line was the only consumer of
prefill_seq_lens_cpu;prefill_seq_lens(device) is still used to buildcu_prefix_kv_lens.python -m py_compilepasses.I don't have a MACA runtime matching current
masterto run the full backendsuite end to end, so this is reviewed as a behaviour-preserving simplification
rather than a benchmarked perf change; happy to add numbers if a maintainer can
point me at the right harness.
(Optional) Documentation Update
None.