Double-buffer GPU activations for overlapping H2D copy with backward compute#534
Double-buffer GPU activations for overlapping H2D copy with backward compute#534ichbinhandsome wants to merge 10 commits intounslothai:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes the gradient checkpointing process in Unsloth by addressing a performance bottleneck where H2D memory transfers for reloaded activations could not overlap with GPU computation. By introducing a second GPU buffer and intelligently managing its allocation and usage, the system can now perform data transfers concurrently with compute operations, leading to a more efficient backward pass and faster fine-tuning times. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b38af64377
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
The pull request introduces a valuable double-buffering optimization for GPU activations, effectively overlapping H2D memory copies with backward computation, aiming for significant performance enhancement. However, a high-severity security finding has been identified: the current implementation contains a race condition. The code fails to synchronize the copy stream with the compute stream when reusing buffers, which can lead to data corruption as a new copy may overwrite a buffer still being read by the GPU. Proper synchronization is recommended. Additionally, there are opportunities to improve error handling by catching more specific exceptions and to refine logging to prevent excessive output.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b359e1856e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a double-buffering optimization for GPU activations during the backward pass, which is a solid approach to improve performance by overlapping memory copies with computation. The implementation correctly uses CUDA events for fine-grained synchronization. My review includes a few suggestions to enhance code quality and robustness, such as replacing magic numbers with predefined constants, avoiding bare except clauses, and refactoring some conditional logic for better conciseness.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a double-buffering optimization for GPU activations to overlap H2D memory copies with backward computation, which is a solid performance enhancement. The implementation correctly uses a second GPU buffer and CUDA events for synchronization. The fallback mechanism to disable double buffering on OOM errors is also well-handled. My feedback focuses on improving the user-facing print statements to make them clearer and less repetitive, providing better feedback during initialization and in case of runtime errors. The provided rules did not require any modifications or dropping of the original comments.
|
Hi @ichbinhandsome, thanks for the PR! I made a commit related to freeing up some dangling GPU refs. Please take a look when you get a chance. |
Thanks @mmathew23 for the review! These fixes look good to me. I also addressed a small global variable scope issue here: 836af30 Besides that, I have a few questions regarding @danielhanchen 's commits:
Agreed on removing the print. One clarification on the buffer resizing behavior though: I believe the resize only happens at most once per training step, not once per layer. Since Unsloth offloads each layer's activations from GPU to CPU, and all layers produce activations of the same size, the buffer only needs to resize from its small initial size on the very first layer. After that, every subsequent layer reuses the same buffer size. Across different training steps, the buffer size can differ because Unsloth uses packing (padding-free), so size of activations may vary per batch. But since the buffer only grows (resizes when new_size > So the print was indeed noisy, but the reason is likely that it fired on the first layer of every step (before the buffer grew to full size on the very first step), not that it fired on every layer. Is my understanding correct?
In the current code, when OOM disables double buffering, it sets both |
|
yes thanks for flagging. I guess we can keep it for safety :) |
|
Hi @mmathew23 @danielhanchen I just got back from vacation and wanted to quickly check on the status of this PR. Have you also observed any performance improvements through this PR? |
danielhanchen
left a comment
There was a problem hiding this comment.
Thank you for the PR! The goal of this PR is to add double-buffering to Unsloth's gradient checkpointing activation reloading, alternating between two GPU buffers (A and B) during the backward pass to overlap H2D copies with gradient computation. As a summary, this PR allocates a second GPU buffer (B), per-buffer CUDA events, and uses CPU_INDEX parity to select which buffer to copy into during backward, while adding OOM fallback, cleanup, and reset logic for the new state.
Benchmark results (Llama-3.2-1B-Instruct, 4-bit LoRA, 61 steps, B200 GPU):
| Method | Tokens/s | Peak Memory | Train Loss |
|---|---|---|---|
| Baseline (main) | 22,410 | 1.528 GB | 1.3790 |
| PR (double-buf ON) | 24,527 | 1.532 GB | 1.3790 |
| PR (double-buf OFF) | 24,436 | 1.530 GB | 1.3790 |
Double-buffer ON vs OFF: +0.4% (within noise). Losses and grad-norms are numerically equivalent across all modes (max delta 0.0035). The double-buffering mechanism is numerically correct and the core idea is sound. The OOM fallback and cleanup paths are safe.
| Reviewers | Severity | Finding |
|---|---|---|
| 6/8 | High | Buffer slot selection uses global CPU_INDEX parity, not per-device state -- breaks multi-GPU/device_map |
| 6/8 | High | torch.cuda.Event() hard-coded; breaks/silently disables on XPU devices |
| 4/8 | Medium | Broad except Exception on buffer-B resize swallows non-OOM errors |
| 2/8 | Low | OOM permanently disables double buffering with no recovery path |
Concrete suggestions for each finding below.
|
Thanks for your feedbacks! @danielhanchen
The results are a bit surprising, as there doesn’t seem to be any performance gain from double buffering. Could you try benchmarking with a larger model and larger activations (e.g., higher batch size or sequence length)? These cases should involve more H2D copy time with unsloth checkpointing in backward pass and may better highlight any potential benefits. If we still don’t observe a performance improvement, do you think it’s worth continuing with this PR? |
836af30 to
fbf8fb5
Compare
|
Benchmark results for this PR on an RTX 5080 GPU:
Increasing either the batch size or sequence length leads to better performance, as larger workloads incur more activation reload overhead during the backward pass for H2D memory transfers. I’ve also addressed all the issues you raised, please let me know if you have any other concerns. @danielhanchen |
|
from my analysis, the memory overhead seems to scale with sequence length. So we should perhaps disable this for low end GPUs that have memory say <16GB? I do see that there are OOM checks for creating buffers but they do not guard against OOMs that happen later right? also yeah I did notice 5-10% ish performance gains for 8B sized model |
|
Thanks for your review! @Datta0
Good catch on the OOM guards. I've addressed this in the latest commit, OOM during buffer resize is now handled carefully:
The current approach enables double buffering optimistically and always prioritize original single buffer strategy. Please let me know if you have any other concerns :) |
|
I'm just thinking of an extreme case workload, say on a 16GB machine, our old setup takes ~15 ish GB. We do proactively allocate the second buffer during forward pass, but during backward pass (which consumes more memory than forward I presume), the process OOMs. One other thing I couldn't benchmark is, how adding a second buffer effects forward/backward performance in cases where memory pressure is already high? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18db07a857
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Make sense. I've added a few safety layers to handle this:
Yeah, double buffering is now only enabled when CUDA free memory > 512MB after the first training step. Under high memory pressure it stays disabled, so there's zero performance impact. Single buffer is always the default, double buffering is only activated as a bonus when there's enough VRAM headroom. Feel free to share any suggestions or concerns! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51b878de77
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 135d9dcdc5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
This PR implements a double-buffering optimization for activation reloading during the backward pass of gradient checkpointing with
use_gradient_checkpointing="unsloth". This means any model fine-tuned with Unsloth's gradient checkpointing will benefit from this PR.Problem: With the current single-buffer implementation, the H2D memory copy for reloading offloaded activations during backward cannot overlap with GPU computation. The copy stream must wait for the compute stream to finish using the shared GPU buffer before writing the next activation into it, creating GPU idle bubbles.
Solution: Allocate a second GPU buffer (
GPU_BUFFERS_B) and alternate between the two buffers based on layer index (CPU_INDEX % 2) with cuda event to prevent race condition. While the compute stream operates on buffer A, the copy stream can simultaneously load the next activation into buffer B, and vice versa — fully overlapping H2D transfers with backward computation.e.g.
Here are the nsys profiling traces:
Performance:
This method doesn’t take much memory as it only takes a buffer size in VRAM, e.g. in gpt-oss-20B model fine-tuning, one buffer takes around 47.18MB with batch size 8.
This PR also implemented a smart double buffering enabling strategy, i.e. if there are enough VRAM we will enable double-buffering otherwise disable.
In my experiments, this achieves approximately 10% speedup per backward pass during GPT-OSS-20B fine-tuning, though results may vary across different models, hyperparameters (e.g. seq_length, batch size etc.) and hardware configurations.