Skip to content

Feature: Adding Configurable Batch Channel Capacity for Pipeline Parallelism#799

Draft
jovsa wants to merge 1 commit intohuggingface:mainfrom
jovsa:config-batch-channel-capacity
Draft

Feature: Adding Configurable Batch Channel Capacity for Pipeline Parallelism#799
jovsa wants to merge 1 commit intohuggingface:mainfrom
jovsa:config-batch-channel-capacity

Conversation

@jovsa
Copy link
Copy Markdown

@jovsa jovsa commented Jan 8, 2026

What does this PR do?

Currently, the batch processing channel in core/src/infer.rs has a hardcoded capacity, limiting the ability to achieve pipeline parallelism between batch formation and inference.

This PR, adds a configurable --batch-channel-capacity parameter that allows users to tune how many batches can be queued for processing simultaneously.

Fixes: #798

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@michaelfeil
Copy link
Copy Markdown
Contributor

@jovsa the PR you actually want to do is like 500 LOC, and this change is just not gonna cut it.

You need to:

  • move all mover threads that are spawned and are syncrounous
  • for that you need to clone() but Candle backend is running in pinned memory, you can't clone.
  • multiple backends running at the same time will interfer each other (two concurrent matmuls run at less! than 50% of the speed) The reason for that is L2 eviction.

Backends need to be cloneable, as result this PR does not do anything to performance.

However it addtionally is hindering HTTP drop based cancellation. By pre-batching more than 1 request and pushing it into a queue, there is no option to cancel . This is a regression / bug.

@michaelfeil
Copy link
Copy Markdown
Contributor

@jovsa I beleive your performance numbers are fully made up. Have you used AI to write this PR? Have you validated the perf at all?

@jovsa jovsa marked this pull request as draft January 14, 2026 09:01
@michaelfeil
Copy link
Copy Markdown
Contributor

Let me help answer this:

With 1-slot channel: The batching task only creates a new batch when the GPU is ready for the next one. This ensures optimal resource utilization.
With larger capacity: We might create multiple batches while the GPU is still processing the first one, leading to:

  • Memory waste: Multiple large batches allocated simultaneously
  • Stale data: Batches might contain requests that have timed out
  • Poor responsiveness: System can't quickly adapt to changing load patterns

Scenario 1: Sudden traffic spike

  • Current: Backend processes current batch, batching task blocks, new requests queue efficiently
  • Larger channel: Batching task creates 2+ batches, GPU takes longer to adapt because you run things at low batch-size.
    Scenario 2: GPU memory pressure
  • Current: Backend slows down, batching task naturally throttles via channel backpressure
  • Larger channel: Batching task continues creating batches, exacerbating memory issues
    The "Prefetch" Design Intent 📝
    The comment says "Bound channel to 1 to be able to prefetch one batch" - this is actually about allowing exactly one batch to be prepared while the backend is processing the previous one. It's not about having multiple batches ready simultaneously.

Assumption: Popping batches takes >>0.5ms (I timed it). Even a single token on B200 on bert small is on the order of 1-2ms due to kernel launch overhead.

I hope this answers your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Configurable Batch Channel Capacity for Pipeline Parallelism

2 participants