Fix pad_input_tensors to pad batch to a multiple of num_processes by iamsharduld · Pull Request #4093 · huggingface/accelerate

iamsharduld · 2026-06-29T12:44:51Z

What

pad_input_tensors (_pad_input_tensors) computes the padding as:

to_pad = num_processes - (batch_size // num_processes)

which is unrelated to the padding needed to make the batch splittable across processes. It:

pads batches that are already divisible (e.g. batch_size=8, num_processes=4 → 10), and
produces sizes that are frequently not a multiple of num_processes (e.g. 6 → 9, 12 → 13,
10,np=5 → 13), so the padded batch still can't be chunked evenly.

inference.py::pippy_forward relies on this to split a batch into num_chunks for ScheduleGPipe,
so a non-divisible result breaks even micro-batching. The existing test_slice_and_concatenate cases
pass only because each (batch_size, num_processes) pair happens to coincide with the correct value.

from accelerate.utils.operations import pad_input_tensors
import torch
pad_input_tensors(torch.rand(8,4), 8, 4).shape   # [10,4]  -- should be [8,4]
pad_input_tensors(torch.rand(6,4), 6, 4).shape   # [9,4]   -- should be [8,4]

Fix

Pad dim-0 up to the next multiple of num_processes (0 when already divisible):

to_pad = (num_processes - batch_size % num_processes) % num_processes

This satisfies the docstring example ([3,4,4], np=4 → [4,4,4]), all 8 existing assertions, and
always yields a size >= batch_size divisible by num_processes.

Tests

tests/test_utils.py::test_pad_input_tensors_divisibility — checks 8→8, 6→8, and the invariant
result % num_processes == 0 and result >= batch_size across num_processes∈[1,8], batch_size∈[1,29]
(129/232 cases were non-divisible before). Existing test_slice_and_concatenate still passes. CPU-only.

_pad_input_tensors used to_pad = num_processes - (batch_size // num_processes), which is unrelated to the padding needed for even splitting: it padded batches that are already divisible (8 -> 10 for num_processes=4) and produced sizes that are frequently NOT a multiple of num_processes (e.g. 6 -> 9, 12 -> 13), so the padded batch still can't be chunked evenly (pippy_forward / ScheduleGPipe rely on this). The existing tests passed only because each (batch_size, num_processes) pair happened to coincide with the correct value. Use the standard pad-to-next-multiple formula: to_pad = (num_processes - batch_size % num_processes) % num_processes which is 0 when already divisible and always yields a multiple of num_processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix pad_input_tensors to pad batch to a multiple of num_processes#4093

Fix pad_input_tensors to pad batch to a multiple of num_processes#4093
iamsharduld wants to merge 1 commit into
huggingface:mainfrom
iamsharduld:pad-input-tensors-divisibility

iamsharduld commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

iamsharduld commented Jun 29, 2026

What

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant