Skip to content

Don't schedule prefill unless there is enough kv cache left to fulfil decode stage#5467

Open
AleksKnezevic wants to merge 3 commits into
mainfrom
aknezevic/prefill_schedule
Open

Don't schedule prefill unless there is enough kv cache left to fulfil decode stage#5467
AleksKnezevic wants to merge 3 commits into
mainfrom
aknezevic/prefill_schedule

Conversation

@AleksKnezevic

@AleksKnezevic AleksKnezevic commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Ticket

Link to Github Issue

Problem description

When generating high OSL sequences if a prefill is scheduled with low KV cache availability, it will be serviced, then immediately discarded as there is not enough cache available to fulfill decode.

What's changed

Added minimal overhead, defaults to 25%, controllable by TT_XLA_PREFILL_KV_WATERMARK_PERCENT

Checklist

  • New/Existing tests provide coverage for changes

@kmabeeTT kmabeeTT left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks good - this should help me replace the max_concurrency=32 -> 8 drop I had to make as a workaround in tt-inference-server to get Qwen3-8B running r1_gpqa_diamond eval to take 80 minutes instead of timeout for this effort (tenstorrent/tt-inference-server#4131 (comment))

Comment thread integrations/vllm_plugin/vllm_tt/platform.py
# See TTConfig.prefill_kv_watermark.
if additional_config.get("prefill_kv_watermark") is None:
additional_config["prefill_kv_watermark"] = TTConfig.prefill_kv_watermark
env_wm = os.environ.get("TT_XLA_PREFILL_KV_WATERMARK_PERCENT")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are mostly using env variable as TTXLA_***

Suggested change
env_wm = os.environ.get("TT_XLA_PREFILL_KV_WATERMARK_PERCENT")
env_wm = os.environ.get("TTXLA_PREFILL_KV_WATERMARK_PERCENT")

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 33.82%. Comparing base (9b0c875) to head (11f9734).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5467      +/-   ##
==========================================
- Coverage   33.84%   33.82%   -0.03%     
==========================================
  Files          37       37              
  Lines        4990     4990              
==========================================
- Hits         1689     1688       -1     
- Misses       3301     3302       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants