Feature: Adding Configurable Batch Channel Capacity for Pipeline Parallelism#799
Feature: Adding Configurable Batch Channel Capacity for Pipeline Parallelism#799jovsa wants to merge 1 commit intohuggingface:mainfrom
Conversation
|
@jovsa the PR you actually want to do is like 500 LOC, and this change is just not gonna cut it. You need to:
Backends need to be cloneable, as result this PR does not do anything to performance. However it addtionally is hindering HTTP drop based cancellation. By pre-batching more than 1 request and pushing it into a queue, there is no option to cancel . This is a regression / bug. |
|
@jovsa I beleive your performance numbers are fully made up. Have you used AI to write this PR? Have you validated the perf at all? |
|
Let me help answer this: With 1-slot channel: The batching task only creates a new batch when the GPU is ready for the next one. This ensures optimal resource utilization.
Scenario 1: Sudden traffic spike
Assumption: Popping batches takes >>0.5ms (I timed it). Even a single token on B200 on bert small is on the order of 1-2ms due to kernel launch overhead. I hope this answers your question. |
What does this PR do?
Currently, the batch processing channel in
core/src/infer.rshas a hardcoded capacity, limiting the ability to achieve pipeline parallelism between batch formation and inference.This PR, adds a configurable
--batch-channel-capacityparameter that allows users to tune how many batches can be queued for processing simultaneously.Fixes: #798
Before submitting
instasnapshots?Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.