feat(moderncolbert): batched POST /pooling support in IO processor by ddickmann · Pull Request #15 · latenceainew/vllm-factory

ddickmann · 2026-04-23T11:07:39Z

Summary

Adds opt-in batched-input support to the moderncolbert_io IOProcessor. A single POST /pooling call can now carry N texts and the processor decomposes them into N prompts inside one factory_pre_process step, letting vLLM's continuous batcher co-schedule them in one engine pass.

Why

Surfaced while wiring moderncolbert_io into a high-fan-out groundedness scorer where one request typically encodes ~30 chunks with the same model. With the existing one-text-per-call contract, the choice was either:

N concurrent HTTP calls per request (CPU-bound on syscalls + serialization, blocks the engine batcher from co-scheduling them), or
A serial latency floor.

Single-call batching collapses both problems and gives the engine batcher the freedom to fuse prompts within a request as well as across them.

Wire format

The single-text shape is unchanged and byte-identical to the previous release — same request body, same single-string base64 response — so existing callers are not affected.

The new batched shape is opt-in:

{
  "data": {
    "text": ["q1", "q2", "..."],
    "is_query": [true, true, true]
  },
  "model": "lightonai/LateOn",
  "task": "plugin"
}

is_query may also be a single bool that broadcasts across all texts.
Mixed-type lists raise an explicit error so callers don't silently mis-pack a batch.
Empty batches raise.
The response for batched requests is list[str] of base64-encoded multi-vector embeddings, in input order.

Implementation notes

ModernColBERTInput now carries a list of texts plus a per-text is_query list and a batched flag recording the request-level shape.
factory_parse detects single vs. batched input from the type of data['text'] and validates is_query length consistency.
factory_pre_process returns Sequence[TokensPrompt] for batches and a single TokensPrompt for the legacy path (preserving the upstream prompt-shape contract).
factory_post_process returns str for legacy single-text requests and list[str] when the request was batched (or when multiple outputs come back), keyed off the stashed batched flag rather than relying on len(model_output) alone — that way batches of size 1 still round-trip as list[str].

The [Q] / [D] prefix logic, max-length budgets (256 query / 8192 doc), pooling task name, and tokenizer settings are all unchanged.

Backward compatibility

Caller shape	Before	After
`{"text": "abc", "is_query": true}`	`str` (b64)	`str` (b64) ✅
`{"text": "abc", "is_query": [true]}`	error	`str` (b64)
`{"text": ["abc"], "is_query": true}`	error	`["..."]`
`{"text": ["a","b"], "is_query": [true,true]}`	error	`["...","..."]`

Single-text path is fully preserved; lists are a strict superset.

Test plan

CI passes
Single-text request still returns a str response (regression check vs. existing client code)
Batched request with N texts returns list[str] of length N in input order
Mixed-type lists in text raise ValueError
Length-mismatched is_query list raises ValueError
Empty text batch raises ValueError
Bool is_query broadcasts across a list of texts

Lets a single ``POST /pooling`` call carry N texts instead of one, fusing them into a single ``factory_pre_process`` step so vLLM's continuous batcher can co-schedule them in one engine pass. For high-fan-out callers (retrieval, late-interaction reranking, groundedness scoring) that drops HTTP-side overhead from O(N) to O(1) and lets the GPU stay saturated without client-side thread-pool fan-out. Wire format ----------- The single-text shape is unchanged and byte-identical to the previous release — same request body, same single-string base64 response — so existing callers are not affected. The new batched shape is opt-in: {"data": {"text": ["q1", "q2", ...], "is_query": [true, true, ...]}, "model": "...", "task": "plugin"} ``is_query`` may also be a single bool that broadcasts across all texts; mixed-type lists raise an explicit error so callers don't silently mis-pack a batch. The response for batched requests is ``list[str]`` of base64-encoded multi-vector embeddings, in input order. Empty batches raise. Implementation -------------- * ``ModernColBERTInput`` now carries a list of texts plus a per-text ``is_query`` list and a ``batched`` flag recording the request-level shape. * ``factory_parse`` detects single vs. batched input from the type of ``data['text']`` and validates ``is_query`` length consistency. * ``factory_pre_process`` returns ``Sequence[TokensPrompt]`` for batches and a single ``TokensPrompt`` for the legacy path (preserving the upstream prompt-shape contract). * ``factory_post_process`` returns ``str`` for legacy single-text requests and ``list[str]`` when the request was batched (or when multiple outputs come back), keyed off the stashed ``batched`` flag rather than relying on ``len(model_output)`` alone — that way batches of size 1 still round-trip as ``list[str]``. The ``[Q]`` / ``[D]`` prefix logic, max-length budgets (256 query / 8192 doc), pooling task name, and tokenizer settings are all unchanged. Why --- Surfaced while wiring ``moderncolbert_io`` into latence-trace's groundedness scorer: a typical request produces ~30 chunks that all need to be encoded by the same model, and N concurrent HTTP-per-chunk calls were burning more time on syscalls and serialization than on GPU work. Single-call batching collapses that overhead and gives the engine batcher the freedom to fuse prompts within a request as well as across them.

Pure formatting fixup so CI's ``ruff format --check forge/ plugins/ vllm_factory/`` job passes — no behavioural change.

ddickmann added 2 commits April 23, 2026 11:07

chore(format): ruff-format moderncolbert io_processor.py

e558fd9

Pure formatting fixup so CI's ``ruff format --check forge/ plugins/ vllm_factory/`` job passes — no behavioural change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(moderncolbert): batched POST /pooling support in IO processor#15

feat(moderncolbert): batched POST /pooling support in IO processor#15
ddickmann wants to merge 2 commits into
mainfrom
feat/moderncolbert-batched-io

ddickmann commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ddickmann commented Apr 23, 2026

Summary

Why

Wire format

Implementation notes

Backward compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant