Skip to content

feat(moderncolbert): batched POST /pooling support in IO processor#15

Open
ddickmann wants to merge 2 commits into
mainfrom
feat/moderncolbert-batched-io
Open

feat(moderncolbert): batched POST /pooling support in IO processor#15
ddickmann wants to merge 2 commits into
mainfrom
feat/moderncolbert-batched-io

Conversation

@ddickmann

Copy link
Copy Markdown
Collaborator

Summary

Adds opt-in batched-input support to the moderncolbert_io IOProcessor. A single POST /pooling call can now carry N texts and the processor decomposes them into N prompts inside one factory_pre_process step, letting vLLM's continuous batcher co-schedule them in one engine pass.

Why

Surfaced while wiring moderncolbert_io into a high-fan-out groundedness scorer where one request typically encodes ~30 chunks with the same model. With the existing one-text-per-call contract, the choice was either:

  • N concurrent HTTP calls per request (CPU-bound on syscalls + serialization, blocks the engine batcher from co-scheduling them), or
  • A serial latency floor.

Single-call batching collapses both problems and gives the engine batcher the freedom to fuse prompts within a request as well as across them.

Wire format

The single-text shape is unchanged and byte-identical to the previous release — same request body, same single-string base64 response — so existing callers are not affected.

The new batched shape is opt-in:

{
  "data": {
    "text": ["q1", "q2", "..."],
    "is_query": [true, true, true]
  },
  "model": "lightonai/LateOn",
  "task": "plugin"
}
  • is_query may also be a single bool that broadcasts across all texts.
  • Mixed-type lists raise an explicit error so callers don't silently mis-pack a batch.
  • Empty batches raise.
  • The response for batched requests is list[str] of base64-encoded multi-vector embeddings, in input order.

Implementation notes

  • ModernColBERTInput now carries a list of texts plus a per-text is_query list and a batched flag recording the request-level shape.
  • factory_parse detects single vs. batched input from the type of data['text'] and validates is_query length consistency.
  • factory_pre_process returns Sequence[TokensPrompt] for batches and a single TokensPrompt for the legacy path (preserving the upstream prompt-shape contract).
  • factory_post_process returns str for legacy single-text requests and list[str] when the request was batched (or when multiple outputs come back), keyed off the stashed batched flag rather than relying on len(model_output) alone — that way batches of size 1 still round-trip as list[str].

The [Q] / [D] prefix logic, max-length budgets (256 query / 8192 doc), pooling task name, and tokenizer settings are all unchanged.

Backward compatibility

Caller shape Before After
{"text": "abc", "is_query": true} str (b64) str (b64) ✅
{"text": "abc", "is_query": [true]} error str (b64)
{"text": ["abc"], "is_query": true} error ["..."]
{"text": ["a","b"], "is_query": [true,true]} error ["...","..."]

Single-text path is fully preserved; lists are a strict superset.

Test plan

  • CI passes
  • Single-text request still returns a str response (regression check vs. existing client code)
  • Batched request with N texts returns list[str] of length N in input order
  • Mixed-type lists in text raise ValueError
  • Length-mismatched is_query list raises ValueError
  • Empty text batch raises ValueError
  • Bool is_query broadcasts across a list of texts

Lets a single ``POST /pooling`` call carry N texts instead of one,
fusing them into a single ``factory_pre_process`` step so vLLM's
continuous batcher can co-schedule them in one engine pass. For
high-fan-out callers (retrieval, late-interaction reranking,
groundedness scoring) that drops HTTP-side overhead from O(N) to
O(1) and lets the GPU stay saturated without client-side thread-pool
fan-out.

Wire format
-----------
The single-text shape is unchanged and byte-identical to the previous
release — same request body, same single-string base64 response — so
existing callers are not affected.

The new batched shape is opt-in:

    {"data": {"text": ["q1", "q2", ...],
              "is_query": [true, true, ...]},
     "model": "...", "task": "plugin"}

``is_query`` may also be a single bool that broadcasts across all
texts; mixed-type lists raise an explicit error so callers don't
silently mis-pack a batch. The response for batched requests is
``list[str]`` of base64-encoded multi-vector embeddings, in input
order. Empty batches raise.

Implementation
--------------
* ``ModernColBERTInput`` now carries a list of texts plus a
  per-text ``is_query`` list and a ``batched`` flag recording the
  request-level shape.
* ``factory_parse`` detects single vs. batched input from the type
  of ``data['text']`` and validates ``is_query`` length consistency.
* ``factory_pre_process`` returns ``Sequence[TokensPrompt]`` for
  batches and a single ``TokensPrompt`` for the legacy path
  (preserving the upstream prompt-shape contract).
* ``factory_post_process`` returns ``str`` for legacy single-text
  requests and ``list[str]`` when the request was batched (or when
  multiple outputs come back), keyed off the stashed ``batched``
  flag rather than relying on ``len(model_output)`` alone — that
  way batches of size 1 still round-trip as ``list[str]``.

The ``[Q]`` / ``[D]`` prefix logic, max-length budgets (256 query
/ 8192 doc), pooling task name, and tokenizer settings are all
unchanged.

Why
---
Surfaced while wiring ``moderncolbert_io`` into latence-trace's
groundedness scorer: a typical request produces ~30 chunks that all
need to be encoded by the same model, and N concurrent
HTTP-per-chunk calls were burning more time on syscalls and
serialization than on GPU work. Single-call batching collapses
that overhead and gives the engine batcher the freedom to fuse
prompts within a request as well as across them.
Pure formatting fixup so CI's ``ruff format --check forge/ plugins/
vllm_factory/`` job passes — no behavioural change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant