Skip to content

[KVCache] Support only flush FD GPU/CPU Cache index by AttentionStore#7609

Open
jackyYang6 wants to merge 1 commit intoPaddlePaddle:developfrom
jackyYang6:kvcache/as_only_flush
Open

[KVCache] Support only flush FD GPU/CPU Cache index by AttentionStore#7609
jackyYang6 wants to merge 1 commit intoPaddlePaddle:developfrom
jackyYang6:kvcache/as_only_flush

Conversation

@jackyYang6
Copy link
Copy Markdown
Contributor

Motivation

This PR improves the FD_AS_ONLY_FLUSH flow for AttentionStore so FastDeploy can flush KV cache index state for both request-finish and CPU-eviction scenarios.

It adds the required flush metadata to support more accurate AttentionStore index updates.

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

  • Extend WriteStorageTask with:
    • flush_cache_exists to indicate whether cache still exists on the current node in FD_AS_ONLY_FLUSH mode.
    • start_write_block_idx to support partial flush/write from a specified block index.
  • Update cache_transfer_manager AS-only flush path to call AttentionStore.flush_token_index(...) with both start_write_block_idx and reside_in_gpu.
  • Update prefix_cache_manager.free_cpu_block_ids(...) to emit flush-only tasks when CPU cache blocks are evicted in FD_AS_ONLY_FLUSH + attention_store mode.
  • Add FD_AS_ONLY_FLUSH environment variable entry in fastdeploy/envs.py.
  • Add unit test coverage for CPU eviction flush behavior, including:
    • flush_cache_exists=False
    • empty gpu_block_ids in flush-only mode
    • correct start_write_block_idx=depth-1

Usage or Command

For FD_AS_ONLY_FLUSH mode with AttentionStore:

export FD_AS_ONLY_FLUSH=1

Reference test command:

python3 -m pytest tests/cache_manager/test_prefix_cache_manager.py -q

Accuracy Tests

N/A. This PR does not change model forward results or kernel numerical behavior. It only updates KV cache index flush metadata and adds unit tests for cache manager behavior.

Checklist

  • Add at least a tag in the PR title.
    • Suggested title: [KVCache] Support flush FD GPU/CPU Cache index by AttentionStore
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag. (N/A for current develop PR)

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-24 14:06:57

📋 Review 摘要

PR 概述:为 AttentionStore 的 FD_AS_ONLY_FLUSH 模式增加仅刷新 KV Cache 索引的能力,支持请求完成和 CPU 驱逐两个场景。
变更范围cache_manager/(cache_tasks、cache_transfer_manager、prefix_cache_manager)、envs.py、单元测试
影响面 Tag[KVCache]


问题

级别 文件 概述
❓ 疑问 prefix_cache_manager.py:1556 node.input_ids 语义不明,若为 partial token ids 则 flush 索引可能错误
🟡 建议 cache_transfer_manager.py:975 flush_cache_exists=None 时静默回退 reside_in_gpu=True,建议添加日志告警防止误用
🟡 建议 prefix_cache_manager.py:1590 建议补充注释说明 is_sync=False 的原因,防止维护者误改

总体评价

整体设计思路清晰,字段默认值设置合理,向后兼容,测试覆盖了主要的 CPU 驱逐场景。核心疑问是 node.input_ids 的语义需要作者确认,若为 partial token ids 则会导致 AttentionStore 索引更新错误,建议优先澄清后合入。

except Exception as e:
logger.warning(f"[AS_ONLY_FLUSH] Failed to flush token index for task {task.task_id}, error: {e}")
result = (CacheStatus.GPU2STORAGE, task.task_id, task.keys, [])
self.cache_task_queue.swap_to_storage_barrier.wait()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 _flush_only_storage_taskreside_in_gpu 语义存疑

task.flush_cache_exists is None 时,代码回退为 reside_in_gpu=True,注释说明此场景是「request finish」,语义上 cache 仍在 GPU,看起来合理。但 None 的含义是「不适用」,若调用方并非 request-finish 场景而忘记赋值,这里会静默传入 True,产生错误的索引状态。

建议在此处添加日志警告,明确 None 仅允许在请求完成(非 flush-only)路径下出现:

if task.flush_cache_exists is None:
    logger.warning("[AS_ONLY_FLUSH] flush_cache_exists is None, treating as cache present (reside_in_gpu=True)")


self.recycle_cpu_blocks(node.block_id)
hash_value_block_ids_map[node.input_hash_value].extend(reversed(tmp_block_ids))
if envs.FD_AS_ONLY_FLUSH and self.kvcache_storage_backend == "attention_store":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 node.input_ids 是当前节点的 token ids 还是整条路径的完整 token ids?

flush_token_index 需要完整的 token_ids 才能正确定位索引。若 input_ids 只是当前节点新增的 token slice,传入的 token_ids 将不完整,导致 AttentionStore 索引更新错误。

请确认 BlockNode.input_ids 的语义,如果是 partial token ids,需要从 root 到当前节点拼接完整路径才能使用。

task_id=str(uuid.uuid4()),
keys=[input_hash_value],
token_ids=token_ids,
gpu_block_ids=[],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 建议补充注释说明 is_sync=False 的原因,避免维护者误改

此处使用 is_sync=False 是因为 flush 操作不需要等待结果,这与请求完成时的 flush(通常同步)行为不同。建议在调用处增加注释说明异步原因,避免后续维护者误改为同步导致死锁或性能问题。

另外,is_sync=False 的任务不在 task_write_back_event 中注册 Event,请确认下游(如 transfer_done 信号消费方)能正确处理仅 rank==0 发出信号的场景。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 48.38710% with 16 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@4c8f7df). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/cache_transfer_manager.py 6.66% 13 Missing and 1 partial ⚠️
fastdeploy/cache_manager/prefix_cache_manager.py 84.61% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7609   +/-   ##
==========================================
  Coverage           ?   71.71%           
==========================================
  Files              ?      419           
  Lines              ?    57819           
  Branches           ?     9073           
==========================================
  Hits               ?    41465           
  Misses             ?    13527           
  Partials           ?     2827           
Flag Coverage Δ
GPU 71.71% <48.38%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants