WIP - do not merge - Vllm v1 hidden states by kyle-pena-kuzco · Pull Request #1 · context-labs/vllm

kyle-pena-kuzco · 2025-06-05T20:50:08Z

No description provided.

github-actions · 2025-06-05T20:50:21Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

kyle-pena-kuzco · 2025-06-06T16:21:27Z

Here is a diagram for our implementation.

sequenceDiagram
   participant async_llm
   participant EngineCore
   participant GpuModelRunner/TpuModelRunner
   EngineCore->>async_llm:EngineCoreOutput
   note over async_llm: Is Sequence Complete?  If yes...
   note over async_llm: Is server enable_return_hidden_states?  If yes...
   note over async_llm: Are hidden states requested?  If yes...
   async_llm->>EngineCore:send HiddenStatesExtractionRequest
   note over EngineCore:  create prefill-only EngineCoreRequest (prompt_token_ids=prompt+response)
   EngineCore->>GpuModelRunner/TpuModelRunner: EngineCoreRequest
   note over GpuModelRunner/TpuModelRunner:Slice out hidden states for only last token
   note over GpuModelRunner/TpuModelRunner:Move slice (1,D) to CPU
   note over GpuModelRunner/TpuModelRunner:Include in ModelRunnerOutput as List[float] (~77kb)
   GpuModelRunner/TpuModelRunner->>EngineCore: ModelRunnerOutput
   EngineCore->>async_llm: EngineCoreOutput

Here is some analysis of internal serialization costs.

After a sequence is completed, if hidden states are requested, a single List[float] is serialized over zmq for the sequence in response to a HiddenStatesExtractionRequest. This corresponds to the selected token's hidden states.
The length of the List[float] is D, where D is the hidden dimension of the model. For example, for 3.1-8b-instruct this is 4,096.
This comes out to about 77kb in raw float bytes for 3.1-8b-instruct, and similar sizes for other models (ranging between about 50kb and 110kb raw float bytes).
The payload size for the List[float] is comparable with other currently supported features like top_logprobs. For example, for 3.1-8b-instruct it is less than returning top-2 logprobs on a typical 500 token response (per v1/logprobs.py).
We minimize the GPU-to-CPU cost by slicing out only the requested token's hidden states from the full hidden states tensor before moving to the CPU (kilobytes), instead of the entire full last layer hidden states tensor (megabytes).

…ionary on res is still keyed incorrectly)

… positoin

…ates

…engine requests still validate the enable_return_hidden_states flag

… API layer

kyle-pena-kuzco added 6 commits June 3, 2025 23:06

checkpointing before implementing rest of hidden states

722b739

checkpoint

9b257ae

core engine hidden states implementation possibly complete

37e424f

another checkpoint - partial API integration

dd65e97

checkpointing on hidden states extraction

5c2e114

implemented true test of hidden states core engine functionality

0b138e3

implemented basic API support. stremaing to follow.

dd34eff

kyle-pena-kuzco added 16 commits June 6, 2025 19:45

cleaned up several unneeded files. fixed some other bugs.

c5d164f

removal of more unneeded stuff

afcae9f

removed more stuff

ca4a83a

continuing cleanup and centralization of tests

b55a6ed

more cleanup, expanded test coverage

8b513e1

more cleanup of unneeded test files

0c09f1e

fixed chat completion streaming test (although the hidden states dict…

c43b5eb

…ionary on res is still keyed incorrectly)

fixed streaming api tests

8daca13

changed property name to be more understandable

e9f7c65

some progress on hdiden states being keyed by req id instead of token…

fdcc2ff

… positoin

fixed streaming for chat completions and completions

88d1f44

removed print statements

a67c221

implemented server level flag for enabling/disabling return hidden st…

8a16a81

…ates

pushed the engine flag validation further down the stack so that raw …

461261f

…engine requests still validate the enable_return_hidden_states flag

more changes to test coverage

4c58a97

fixes for the gathering of hidden states and their consumption in the…

fe78b9e

… API layer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - do not merge - Vllm v1 hidden states#1

WIP - do not merge - Vllm v1 hidden states#1
kyle-pena-kuzco wants to merge 23 commits intomainfrom
vllm-v1-hidden-states

kyle-pena-kuzco commented Jun 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

kyle-pena-kuzco commented Jun 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kyle-pena-kuzco commented Jun 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

kyle-pena-kuzco commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kyle-pena-kuzco commented Jun 5, 2025 •

edited by github-actions bot

Loading

kyle-pena-kuzco commented Jun 6, 2025 •

edited

Loading