Skip to content

[KVCache] Support AttentionStore slice write blocks#7614

Open
jackyYang6 wants to merge 1 commit intoPaddlePaddle:developfrom
jackyYang6:as/as_slice_write
Open

[KVCache] Support AttentionStore slice write blocks#7614
jackyYang6 wants to merge 1 commit intoPaddlePaddle:developfrom
jackyYang6:as/as_slice_write

Conversation

@jackyYang6
Copy link
Copy Markdown
Contributor

Motivation

This PR adds slice write support for AttentionStore to avoid writing a large number of KV cache blocks in a single SDK call.

It makes large cache write-back more controllable by splitting writes into multiple slices with configurable per-slice and total timeout limits.

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

  • Update fastdeploy/cache_manager/transfer_factory/mooncake_store/attention_store.py write path to split a single large write request into multiple slices.
  • Add configurable write controls:
    • AS_WRITE_TOTAL_TIMEOUT
    • AS_WRITE_SLICE_BLOCK_NUM
    • AS_WRITE_SLICE_TIMEOUT
  • For each slice, compute the corresponding slice_write_block_idx and truncated token_ids, then call AttentionStoreSDK.write(...) incrementally.
  • Add logging for slice begin/end, incomplete writes, and total write summary.
  • Stop subsequent slices when a slice write is incomplete, so prefix cache continuity is preserved.
  • No unit tests are added in this PR because this change is in the AttentionStore SDK integration path and requires an AttentionStore runtime environment for end-to-end validation.

Usage or Command

Optional environment variables for AttentionStore slice write:

export AS_WRITE_TOTAL_TIMEOUT=30
export AS_WRITE_SLICE_BLOCK_NUM=500
export AS_WRITE_SLICE_TIMEOUT=10

Accuracy Tests

N/A. This PR does not affect model forward outputs or kernel numerical behavior. It only changes the AttentionStore KV cache write-back strategy.

Checklist

  • Add at least a tag in the PR title.
    • Suggested title: [KVCache] Support AttentionStore slice write blocks
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag. (N/A for current develop PR)

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@jackyYang6 jackyYang6 changed the title [KVCache] Support AttentionStore sclice write blocks [KVCache] Support AttentionStore slice write blocks Apr 24, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 37 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ee81b57). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...transfer_factory/mooncake_store/attention_store.py 0.00% 37 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7614   +/-   ##
==========================================
  Coverage           ?   71.66%           
==========================================
  Files              ?      419           
  Lines              ?    57882           
  Branches           ?     9080           
==========================================
  Hits               ?    41479           
  Misses             ?    13577           
  Partials           ?     2826           
Flag Coverage Δ
GPU 71.66% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-24 19:53:19

📋 Review 摘要

PR 概述:为 AttentionStore 的 KVCache 写入路径增加分片写入(Slice Write)支持,通过将大批量 block 写入拆分为多个切片来提升可控性。
变更范围cache_manager/transfer_factory/mooncake_store/attention_store.pywrite() 方法重写
影响面 TagKVCache

问题

级别 文件 概述
🔴 Bug attention_store.py:167 AS_WRITE_SLICE_BLOCK_NUM=0range() 抛出 ValueError 导致写操作崩溃
🟡 建议 attention_store.py:170 [WRITE BEGIN] 日志两段 f-string 拼接缺少空格,日志可读性差
🟡 建议 attention_store.py:233 [WRITE END] 日志级别从 debug 升级为 info,高并发场景下可能造成日志量暴增
❓ 疑问 attention_store.py:166~168 os.getenv 在每次 write() 调用时读取,高频写入路径建议在 __init__ 中缓存

总体评价

分片写入逻辑设计合理,前缀缓存连续性保障(写入不完整时提前 break)思路正确。需修复 slice_block_num=0 的防御性校验(P0 崩溃风险),并建议将 [WRITE END] 日志级别调回 debug 以避免生产环境日志量过大。

slice_block_num = int(os.getenv("AS_WRITE_SLICE_BLOCK_NUM", "500"))
slice_timeout = float(os.getenv("AS_WRITE_SLICE_TIMEOUT", "10"))
logger.debug(
f"[WRITE BEGIN] task_id: {task_id} token_ids: {token_ids} gpu_block_ids: {gpu_block_ids}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug AS_WRITE_SLICE_BLOCK_NUM=0 时将导致 range() 抛出 ValueError 崩溃

range(0, total_blocks, 0) 会抛出 ValueError: range() arg 3 must not be zero,进而导致整个写操作崩溃,影响前缀缓存服务稳定性。同理,负数值也会导致静默跳过所有写入(total_blocks > 0 时什么都不写)。

建议在读取后立即做保护:

slice_block_num = int(os.getenv("AS_WRITE_SLICE_BLOCK_NUM", "500"))
if slice_block_num <= 0:
    logger.warning(f"AS_WRITE_SLICE_BLOCK_NUM={slice_block_num} is invalid, reset to 500")
    slice_block_num = 500

f"[WRITE BEGIN] task_id: {task_id} token_ids: {token_ids} gpu_block_ids: {gpu_block_ids}"
f"start_write_block_idx: {start_write_block_idx} timeout: {total_timeout}"
)
total_blocks = len(gpu_block_ids)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 [WRITE BEGIN] 日志两段 f-string 拼接时缺少空格分隔符

两个相邻 f-string 直接拼接,第一段末尾 {gpu_block_ids} 与第二段开头 start_write_block_idx: 之间没有空格,日志输出将显示 ...{gpu_block_ids}start_write_block_idx: ...,可读性差且与其他字段格式不一致。

建议修复:

f"[WRITE BEGIN] task_id: {task_id} token_ids: {token_ids} gpu_block_ids: {gpu_block_ids} "
f"start_write_block_idx: {start_write_block_idx} timeout: {total_timeout}"

)

total_cost = time.time() - overall_start
logger.info(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 [WRITE END] 日志级别由 debug 提升为 info,在高并发推理场景下可能产生大量日志

原代码 [WRITE END] 使用 logger.debug,本次改为 logger.info,每次写操作均会打印。大规模推理场景下 KVCache 写操作频繁,info 级别日志量将显著增大,影响磁盘 IO 和日志系统性能。

建议改回 logger.debug,或仅在总写入不完整时(total_written < total_blocks)使用 logger.warning 提示。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants