Custom setup for training llama. Can't afford to use huggingface format as I need pytorch flexibility for future endevors with these weights.
Trying to make this as flexible/simple as possible for reuse
Taking advantage of the parallelism, I sync weights as follows in traininglib/gradient_updates.py
- Initialize divisor as 2
- if item index is not divisible by divisor, send gradients to device_num // divisor
- add the two gradients
- repeat until we get to one one gradient remaining
- average, and send everything back
Note that the accumulation process here takes log(num_gpu) iterations, which is important because gpu-gpu transfer is expenseive