Skip to content

Fix warmup LR being zero at step 0#1379

Open
Robby955 wants to merge 1 commit intoEleutherAI:mainfrom
Robby955:fix/warmup-lr-step-zero
Open

Fix warmup LR being zero at step 0#1379
Robby955 wants to merge 1 commit intoEleutherAI:mainfrom
Robby955:fix/warmup-lr-step-zero

Conversation

@Robby955
Copy link
Copy Markdown

@Robby955 Robby955 commented Apr 8, 2026

Summary

Fixes #1373

The learning rate warmup formula in megatron/learning_rates.py produces LR=0 at step 0, causing the first training step to be a complete no-op. The step-1 checkpoint is identical to the step-0 checkpoint.

The Bug

The warmup formula on line 70 is:

return float(self.start_lr) * num_iters_ / self.warmup_iter

At step 0, num_iters_ is 0, so:

LR = start_lr * 0 / warmup_iter = 0

This means the gradient update at step 0 is multiplied by zero -- the model parameters don't change at all.

The Fix

# Before (step 0 → LR = 0):
return float(self.start_lr) * num_iters_ / self.warmup_iter

# After (step 0 → LR = start_lr / warmup_iter, same as step 1):
return float(self.start_lr) * max(1, num_iters_) / self.warmup_iter

Using max(1, num_iters_) ensures step 0 gets the same small nonzero LR that step 1 would have received (start_lr / warmup_iter), rather than zero. This is the minimal fix -- it avoids introducing new parameters or changing the config API.

Before/After behavior (warmup_iter=1000, start_lr=1e-3):

Step Before (LR) After (LR)
0 0.0 1e-6
1 1e-6 1e-6
2 2e-6 2e-6
... ... ...
1000 1e-3 1e-3

Steps 1+ are completely unchanged. Only step 0 is affected.

Impact

This bug affects all models trained with gpt-neox using LR warmup, including all Pythia models on the HuggingFace Hub (as noted by @StellaAthena in the issue). In every case, the step-1 checkpoint is identical to the step-0 checkpoint because the first training step does nothing.

At step 0, num_iters is 0 so the warmup formula
`start_lr * num_iters / warmup_iter` yields LR=0.
This means the very first training step is a no-op:
the step-1 checkpoint is identical to the step-0 checkpoint.

Fix by using max(1, num_iters) in the warmup formula so
step 0 gets the same small nonzero LR that step 1 would
have received (start_lr / warmup_iter).

Fixes EleutherAI#1373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

step1 models are always the same as step0 models

1 participant