Don't schedule prefill unless there is enough kv cache left to fulfil decode stage by AleksKnezevic · Pull Request #5467 · tenstorrent/tt-xla

AleksKnezevic · 2026-06-30T15:33:22Z

Ticket

Problem description

When generating high OSL sequences if a prefill is scheduled with low KV cache availability, it will be serviced, then immediately discarded as there is not enough cache available to fulfill decode.

What's changed

Added minimal overhead, defaults to 25%, controllable by TT_XLA_PREFILL_KV_WATERMARK_PERCENT

Checklist

New/Existing tests provide coverage for changes

…l decode stage

kmabeeTT

Awesome, looks good - this should help me replace the max_concurrency=32 -> 8 drop I had to make as a workaround in tt-inference-server to get Qwen3-8B running r1_gpqa_diamond eval to take 80 minutes instead of timeout for this effort (tenstorrent/tt-inference-server#4131 (comment))

mmanzoorTT · 2026-06-30T16:40:42Z

+        # See TTConfig.prefill_kv_watermark.
+        if additional_config.get("prefill_kv_watermark") is None:
+            additional_config["prefill_kv_watermark"] = TTConfig.prefill_kv_watermark
+        env_wm = os.environ.get("TT_XLA_PREFILL_KV_WATERMARK_PERCENT")


We are mostly using env variable as TTXLA_***

Suggested change

env_wm = os.environ.get("TT_XLA_PREFILL_KV_WATERMARK_PERCENT")

env_wm = os.environ.get("TTXLA_PREFILL_KV_WATERMARK_PERCENT")

codecov-commenter · 2026-06-30T16:46:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 33.82%. Comparing base (9b0c875) to head (11f9734).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5467      +/-   ##
==========================================
- Coverage   33.84%   33.82%   -0.03%     
==========================================
  Files          37       37              
  Lines        4990     4990              
==========================================
- Hits         1689     1688       -1     
- Misses       3301     3302       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

AleksKnezevic added 2 commits June 30, 2026 15:15

Don't schedule prefill unless there is enough kv cache left to fulfil…

23f6ce8

…l decode stage

Change env var

7737970

AleksKnezevic requested review from alinakhanTT, kmabeeTT, ljovanovicTT and mmanzoorTT as code owners June 30, 2026 15:33

kmabeeTT approved these changes Jun 30, 2026

View reviewed changes

Comment thread integrations/vllm_plugin/vllm_tt/platform.py

Address feedback

11f9734

mmanzoorTT approved these changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't schedule prefill unless there is enough kv cache left to fulfil decode stage#5467

Don't schedule prefill unless there is enough kv cache left to fulfil decode stage#5467
AleksKnezevic wants to merge 3 commits into
mainfrom
aknezevic/prefill_schedule

AleksKnezevic commented Jun 30, 2026 •

edited

Loading

Uh oh!

kmabeeTT left a comment

Uh oh!

Uh oh!

mmanzoorTT Jun 30, 2026

Uh oh!

codecov-commenter commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	env_wm = os.environ.get("TT_XLA_PREFILL_KV_WATERMARK_PERCENT")
	env_wm = os.environ.get("TTXLA_PREFILL_KV_WATERMARK_PERCENT")

Uh oh!

Conversation

AleksKnezevic commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Checklist

Uh oh!

kmabeeTT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mmanzoorTT Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 30, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AleksKnezevic commented Jun 30, 2026 •

edited

Loading