Optimize routing for multimodal inputs with image/video URLs in the request#2666
Optimize routing for multimodal inputs with image/video URLs in the request#2666rahulgurnani wants to merge 3 commits intokubernetes-sigs:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rahulgurnani The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
0a1ea23 to
f1c6f9a
Compare
|
/retest |
| type ContentBlock struct { | ||
| Type string `json:"type"` | ||
| Text string `json:"text,omitempty"` | ||
| ImageURL ImageBlock `json:"image_url,omitempty"` |
There was a problem hiding this comment.
I'm wondering did you just append the image url before text? If so, probably this will cause issues when you have multiple image urls has the same prefix like gs://image-repo/xxxx
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
Prefixing the text with image/video URL in prefix cache scorer, leads to better performance i.e. higher throughput and lower response latency.
This is an interim change until #2665 is ideated/finalised.
What this PR does / why we need it:
Prefixing with URL in prefix cache scorer to index on encoder cache, specifically valuable in case of multimodal inputs with URLs.
This is an interim change until #2665 is ideated/finalised.
Benchmarking results
By adding url as a prefix, the throughput/TTFT seem to improve like of the order of 40%. These numbers are comparing the changes in this PR with EPP on main. The same_url_multi_text.py takes like 14.41 seconds on main and it takes 9.06 seconds on this PR.
Which issue(s) this PR fixes:
This PR is related to #2665 in a way they support each other
Does this PR introduce a user-facing change?: