Feature: Adding Configurable Batch Channel Capacity for Pipeline Parallelism by jovsa · Pull Request #799 · huggingface/text-embeddings-inference

jovsa · 2026-01-08T01:05:27Z

What does this PR do?

Currently, the batch processing channel in core/src/infer.rs has a hardcoded capacity, limiting the ability to achieve pipeline parallelism between batch formation and inference.

This PR, adds a configurable --batch-channel-capacity parameter that allows users to tune how many batches can be queued for processing simultaneously.

Fixes: #798

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

michaelfeil · 2026-01-13T14:46:57Z

@jovsa the PR you actually want to do is like 500 LOC, and this change is just not gonna cut it.

You need to:

move all mover threads that are spawned and are syncrounous
for that you need to clone() but Candle backend is running in pinned memory, you can't clone.
multiple backends running at the same time will interfer each other (two concurrent matmuls run at less! than 50% of the speed) The reason for that is L2 eviction.

Backends need to be cloneable, as result this PR does not do anything to performance.

However it addtionally is hindering HTTP drop based cancellation. By pre-batching more than 1 request and pushing it into a queue, there is no option to cancel . This is a regression / bug.

michaelfeil · 2026-01-13T14:47:52Z

@jovsa I beleive your performance numbers are fully made up. Have you used AI to write this PR? Have you validated the perf at all?

michaelfeil · 2026-01-23T21:29:11Z

Let me help answer this:

With 1-slot channel: The batching task only creates a new batch when the GPU is ready for the next one. This ensures optimal resource utilization.
With larger capacity: We might create multiple batches while the GPU is still processing the first one, leading to:

Memory waste: Multiple large batches allocated simultaneously
Stale data: Batches might contain requests that have timed out
Poor responsiveness: System can't quickly adapt to changing load patterns

Scenario 1: Sudden traffic spike

Current: Backend processes current batch, batching task blocks, new requests queue efficiently
Larger channel: Batching task creates 2+ batches, GPU takes longer to adapt because you run things at low batch-size.
Scenario 2: GPU memory pressure
Current: Backend slows down, batching task naturally throttles via channel backpressure
Larger channel: Batching task continues creating batches, exacerbating memory issues
The "Prefetch" Design Intent 📝
The comment says "Bound channel to 1 to be able to prefetch one batch" - this is actually about allowing exactly one batch to be prepared while the backend is processing the previous one. It's not about having multiple batches ready simultaneously.

Assumption: Popping batches takes >>0.5ms (I timed it). Even a single token on B200 on bert small is on the order of 1-2ms due to kernel launch overhead.

I hope this answers your question.

add configurable batch channel capcity

8ea551e

jovsa marked this pull request as draft January 14, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Adding Configurable Batch Channel Capacity for Pipeline Parallelism#799

Feature: Adding Configurable Batch Channel Capacity for Pipeline Parallelism#799
jovsa wants to merge 1 commit intohuggingface:mainfrom
jovsa:config-batch-channel-capacity

jovsa commented Jan 8, 2026 •

edited

Loading

Uh oh!

michaelfeil commented Jan 13, 2026

Uh oh!

michaelfeil commented Jan 13, 2026

Uh oh!

michaelfeil commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jovsa commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

michaelfeil commented Jan 13, 2026

Uh oh!

michaelfeil commented Jan 13, 2026

Uh oh!

michaelfeil commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jovsa commented Jan 8, 2026 •

edited

Loading