Skip to content

Fix pad_input_tensors to pad batch to a multiple of num_processes#4093

Open
iamsharduld wants to merge 1 commit into
huggingface:mainfrom
iamsharduld:pad-input-tensors-divisibility
Open

Fix pad_input_tensors to pad batch to a multiple of num_processes#4093
iamsharduld wants to merge 1 commit into
huggingface:mainfrom
iamsharduld:pad-input-tensors-divisibility

Conversation

@iamsharduld

Copy link
Copy Markdown

What

pad_input_tensors (_pad_input_tensors) computes the padding as:

to_pad = num_processes - (batch_size // num_processes)

which is unrelated to the padding needed to make the batch splittable across processes. It:

  • pads batches that are already divisible (e.g. batch_size=8, num_processes=410), and
  • produces sizes that are frequently not a multiple of num_processes (e.g. 6 → 9, 12 → 13,
    10,np=5 → 13), so the padded batch still can't be chunked evenly.

inference.py::pippy_forward relies on this to split a batch into num_chunks for ScheduleGPipe,
so a non-divisible result breaks even micro-batching. The existing test_slice_and_concatenate cases
pass only because each (batch_size, num_processes) pair happens to coincide with the correct value.

from accelerate.utils.operations import pad_input_tensors
import torch
pad_input_tensors(torch.rand(8,4), 8, 4).shape   # [10,4]  -- should be [8,4]
pad_input_tensors(torch.rand(6,4), 6, 4).shape   # [9,4]   -- should be [8,4]

Fix

Pad dim-0 up to the next multiple of num_processes (0 when already divisible):

to_pad = (num_processes - batch_size % num_processes) % num_processes

This satisfies the docstring example ([3,4,4], np=4 → [4,4,4]), all 8 existing assertions, and
always yields a size >= batch_size divisible by num_processes.

Tests

tests/test_utils.py::test_pad_input_tensors_divisibility — checks 8→8, 6→8, and the invariant
result % num_processes == 0 and result >= batch_size across num_processes∈[1,8], batch_size∈[1,29]
(129/232 cases were non-divisible before). Existing test_slice_and_concatenate still passes. CPU-only.

_pad_input_tensors used to_pad = num_processes - (batch_size // num_processes),
which is unrelated to the padding needed for even splitting: it padded batches
that are already divisible (8 -> 10 for num_processes=4) and produced sizes that
are frequently NOT a multiple of num_processes (e.g. 6 -> 9, 12 -> 13), so the
padded batch still can't be chunked evenly (pippy_forward / ScheduleGPipe rely
on this). The existing tests passed only because each (batch_size, num_processes)
pair happened to coincide with the correct value.

Use the standard pad-to-next-multiple formula:
  to_pad = (num_processes - batch_size % num_processes) % num_processes
which is 0 when already divisible and always yields a multiple of num_processes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant