feat(moderncolbert): batched POST /pooling support in IO processor#15
Open
ddickmann wants to merge 2 commits into
Open
feat(moderncolbert): batched POST /pooling support in IO processor#15ddickmann wants to merge 2 commits into
ddickmann wants to merge 2 commits into
Conversation
Lets a single ``POST /pooling`` call carry N texts instead of one,
fusing them into a single ``factory_pre_process`` step so vLLM's
continuous batcher can co-schedule them in one engine pass. For
high-fan-out callers (retrieval, late-interaction reranking,
groundedness scoring) that drops HTTP-side overhead from O(N) to
O(1) and lets the GPU stay saturated without client-side thread-pool
fan-out.
Wire format
-----------
The single-text shape is unchanged and byte-identical to the previous
release — same request body, same single-string base64 response — so
existing callers are not affected.
The new batched shape is opt-in:
{"data": {"text": ["q1", "q2", ...],
"is_query": [true, true, ...]},
"model": "...", "task": "plugin"}
``is_query`` may also be a single bool that broadcasts across all
texts; mixed-type lists raise an explicit error so callers don't
silently mis-pack a batch. The response for batched requests is
``list[str]`` of base64-encoded multi-vector embeddings, in input
order. Empty batches raise.
Implementation
--------------
* ``ModernColBERTInput`` now carries a list of texts plus a
per-text ``is_query`` list and a ``batched`` flag recording the
request-level shape.
* ``factory_parse`` detects single vs. batched input from the type
of ``data['text']`` and validates ``is_query`` length consistency.
* ``factory_pre_process`` returns ``Sequence[TokensPrompt]`` for
batches and a single ``TokensPrompt`` for the legacy path
(preserving the upstream prompt-shape contract).
* ``factory_post_process`` returns ``str`` for legacy single-text
requests and ``list[str]`` when the request was batched (or when
multiple outputs come back), keyed off the stashed ``batched``
flag rather than relying on ``len(model_output)`` alone — that
way batches of size 1 still round-trip as ``list[str]``.
The ``[Q]`` / ``[D]`` prefix logic, max-length budgets (256 query
/ 8192 doc), pooling task name, and tokenizer settings are all
unchanged.
Why
---
Surfaced while wiring ``moderncolbert_io`` into latence-trace's
groundedness scorer: a typical request produces ~30 chunks that all
need to be encoded by the same model, and N concurrent
HTTP-per-chunk calls were burning more time on syscalls and
serialization than on GPU work. Single-call batching collapses
that overhead and gives the engine batcher the freedom to fuse
prompts within a request as well as across them.
Pure formatting fixup so CI's ``ruff format --check forge/ plugins/ vllm_factory/`` job passes — no behavioural change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in batched-input support to the
moderncolbert_ioIOProcessor. A singlePOST /poolingcall can now carry N texts and the processor decomposes them into N prompts inside onefactory_pre_processstep, letting vLLM's continuous batcher co-schedule them in one engine pass.Why
Surfaced while wiring
moderncolbert_iointo a high-fan-out groundedness scorer where one request typically encodes ~30 chunks with the same model. With the existing one-text-per-call contract, the choice was either:Single-call batching collapses both problems and gives the engine batcher the freedom to fuse prompts within a request as well as across them.
Wire format
The single-text shape is unchanged and byte-identical to the previous release — same request body, same single-string base64 response — so existing callers are not affected.
The new batched shape is opt-in:
{ "data": { "text": ["q1", "q2", "..."], "is_query": [true, true, true] }, "model": "lightonai/LateOn", "task": "plugin" }is_querymay also be a single bool that broadcasts across all texts.list[str]of base64-encoded multi-vector embeddings, in input order.Implementation notes
ModernColBERTInputnow carries a list of texts plus a per-textis_querylist and abatchedflag recording the request-level shape.factory_parsedetects single vs. batched input from the type ofdata['text']and validatesis_querylength consistency.factory_pre_processreturnsSequence[TokensPrompt]for batches and a singleTokensPromptfor the legacy path (preserving the upstream prompt-shape contract).factory_post_processreturnsstrfor legacy single-text requests andlist[str]when the request was batched (or when multiple outputs come back), keyed off the stashedbatchedflag rather than relying onlen(model_output)alone — that way batches of size 1 still round-trip aslist[str].The
[Q]/[D]prefix logic, max-length budgets (256 query / 8192 doc), pooling task name, and tokenizer settings are all unchanged.Backward compatibility
{"text": "abc", "is_query": true}str(b64)str(b64) ✅{"text": "abc", "is_query": [true]}str(b64){"text": ["abc"], "is_query": true}["..."]{"text": ["a","b"], "is_query": [true,true]}["...","..."]Single-text path is fully preserved; lists are a strict superset.
Test plan
strresponse (regression check vs. existing client code)list[str]of length N in input ordertextraiseValueErroris_querylist raisesValueErrortextbatch raisesValueErroris_querybroadcasts across a list of texts