Skip to content

Double-buffer GPU activations for overlapping H2D copy with backward compute#534

Open
ichbinhandsome wants to merge 10 commits intounslothai:mainfrom
ichbinhandsome:double_buffers_backward
Open

Double-buffer GPU activations for overlapping H2D copy with backward compute#534
ichbinhandsome wants to merge 10 commits intounslothai:mainfrom
ichbinhandsome:double_buffers_backward

Conversation

@ichbinhandsome
Copy link
Copy Markdown
Contributor

@ichbinhandsome ichbinhandsome commented Mar 6, 2026

This PR implements a double-buffering optimization for activation reloading during the backward pass of gradient checkpointing with use_gradient_checkpointing="unsloth". This means any model fine-tuned with Unsloth's gradient checkpointing will benefit from this PR.

Problem: With the current single-buffer implementation, the H2D memory copy for reloading offloaded activations during backward cannot overlap with GPU computation. The copy stream must wait for the compute stream to finish using the shared GPU buffer before writing the next activation into it, creating GPU idle bubbles.

Solution: Allocate a second GPU buffer (GPU_BUFFERS_B) and alternate between the two buffers based on layer index (CPU_INDEX % 2) with cuda event to prevent race condition. While the compute stream operates on buffer A, the copy stream can simultaneously load the next activation into buffer B, and vice versa — fully overlapping H2D transfers with backward computation.

e.g.

Layer 23 backward:
  1. H2D copy into buffer A (EXTRA_STREAM)
  2. MAIN_STREAM.wait_stream(EXTRA_STREAM)  // wait for copy
  3. Compute on buffer A                   
Layer 22 backward:
  1. H2D copy into buffer B (EXTRA_STREAM)  // overlaps with step 3 above
  2. MAIN_STREAM.wait_stream(EXTRA_STREAM)
  3. Compute on buffer B                    
Layer 21 backward:
  1. H2D copy into buffer A (EXTRA_STREAM)  // Layer 23's compute already done
  2. MAIN_STREAM.wait_stream(EXTRA_STREAM)
  3. Compute on buffer A

Here are the nsys profiling traces:

  • Without this PR
image
  • With this PR
image

Performance:

This method doesn’t take much memory as it only takes a buffer size in VRAM, e.g. in gpt-oss-20B model fine-tuning, one buffer takes around 47.18MB with batch size 8.
This PR also implemented a smart double buffering enabling strategy, i.e. if there are enough VRAM we will enable double-buffering otherwise disable.
In my experiments, this achieves approximately 10% speedup per backward pass during GPT-OSS-20B fine-tuning, though results may vary across different models, hyperparameters (e.g. seq_length, batch size etc.) and hardware configurations.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes the gradient checkpointing process in Unsloth by addressing a performance bottleneck where H2D memory transfers for reloaded activations could not overlap with GPU computation. By introducing a second GPU buffer and intelligently managing its allocation and usage, the system can now perform data transfers concurrently with compute operations, leading to a more efficient backward pass and faster fine-tuning times.

Highlights

  • Double-Buffering Implementation: Introduced a double-buffering mechanism for GPU activations during the backward pass of gradient checkpointing, utilizing a second set of GPU buffers (GPU_BUFFERS_B).
  • H2D Copy Overlap: Enabled overlapping of Host-to-Device (H2D) memory copies with GPU computation by alternating between two GPU buffers, significantly reducing GPU idle time.
  • Dynamic Double-Buffering: Implemented a smart strategy to dynamically enable double-buffering if sufficient VRAM is available, and gracefully disable it if memory allocation fails.
  • Performance Improvement: Achieved an approximate 10% speedup per backward pass during GPT-OSS-20B fine-tuning, demonstrating a notable performance gain.
Changelog
  • unsloth_zoo/gradient_checkpointing.py
    • Introduced global variables GPU_BUFFERS_B and USE_DOUBLE_BUFFER to manage the second GPU buffer and its activation status.
    • Implemented conditional allocation of GPU_BUFFERS_B during initialization, enabling double-buffering if memory permits, otherwise falling back to single-buffering.
    • Modified the forward pass to dynamically resize GPU_BUFFERS_B and disable double-buffering if an out-of-memory error occurs during resizing.
    • Updated the backward pass to alternate between GPU_BUFFERS and GPU_BUFFERS_B for H2D copies, based on the layer index, when double-buffering is active.
    • Removed the explicit stream synchronization (EXTRA_STREAM.wait_stream(MAIN_STREAM)) in the backward pass when double-buffering is enabled, allowing H2D copies to overlap with compute.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b38af64377

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a valuable double-buffering optimization for GPU activations, effectively overlapping H2D memory copies with backward computation, aiming for significant performance enhancement. However, a high-severity security finding has been identified: the current implementation contains a race condition. The code fails to synchronize the copy stream with the compute stream when reusing buffers, which can lead to data corruption as a new copy may overwrite a buffer still being read by the GPU. Proper synchronization is recommended. Additionally, there are opportunities to improve error handling by catching more specific exceptions and to refine logging to prevent excessive output.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b359e1856e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ichbinhandsome
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a double-buffering optimization for GPU activations during the backward pass, which is a solid approach to improve performance by overlapping memory copies with computation. The implementation correctly uses CUDA events for fine-grained synchronization. My review includes a few suggestions to enhance code quality and robustness, such as replacing magic numbers with predefined constants, avoiding bare except clauses, and refactoring some conditional logic for better conciseness.

@ichbinhandsome
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a double-buffering optimization for GPU activations to overlap H2D memory copies with backward computation, which is a solid performance enhancement. The implementation correctly uses a second GPU buffer and CUDA events for synchronization. The fallback mechanism to disable double buffering on OOM errors is also well-handled. My feedback focuses on improving the user-facing print statements to make them clearer and less repetitive, providing better feedback during initialization and in case of runtime errors. The provided rules did not require any modifications or dropping of the original comments.

@mmathew23
Copy link
Copy Markdown
Collaborator

Hi @ichbinhandsome, thanks for the PR! I made a commit related to freeing up some dangling GPU refs. Please take a look when you get a chance.

@ichbinhandsome
Copy link
Copy Markdown
Contributor Author

Hi @ichbinhandsome, thanks for the PR! I made a commit related to freeing up some dangling GPU refs. Please take a look when you get a chance.

Thanks @mmathew23 for the review! These fixes look good to me. I also addressed a small global variable scope issue here: 836af30

Besides that, I have a few questions regarding @danielhanchen 's commits:

Remove print("Unsloth: Double buffering enabled...") that fires on every buffer resize during warmup (once per layer), not once per run.

Agreed on removing the print. One clarification on the buffer resizing behavior though: I believe the resize only happens at most once per training step, not once per layer. Since Unsloth offloads each layer's activations from GPU to CPU, and all layers produce activations of the same size, the buffer only needs to resize from its small initial size on the very first layer. After that, every subsequent layer reuses the same buffer size.

Across different training steps, the buffer size can differ because Unsloth uses packing (padding-free), so size of activations may vary per batch. But since the buffer only grows (resizes when new_size > buffer.numel()), it will monotonically increase to the largest activation seen so far and stay there. If the buffer required for the current batch is smaller than the allocated buffer size, the resize operation will not be triggered.

So the print was indeed noisy, but the reason is likely that it fired on the first layer of every step (before the buffer grew to full size on the very first step), not that it fired on every layer. Is my understanding correct?

Reset USE_DOUBLE_BUFFER in reset_unsloth_gradient_checkpointing_buffers()
so that if OOM disabled double buffering mid-run, subsequent trainer.train()
calls re-enable it when buffer B still exists.

In the current code, when OOM disables double buffering, it sets both USE_DOUBLE_BUFFER = False and GPU_BUFFERS_B = None, so by the time we reach this check, GPU_BUFFERS_B is None and the re-enable never triggers. Please correct me if I was wrong. (@mmathew23 for the buffer freeing up)
I am happy to keep it for safety, just wanted to flag it :)

@mmathew23
Copy link
Copy Markdown
Collaborator

yes thanks for flagging. I guess we can keep it for safety :)

@ichbinhandsome
Copy link
Copy Markdown
Contributor Author

Hi @mmathew23 @danielhanchen I just got back from vacation and wanted to quickly check on the status of this PR. Have you also observed any performance improvements through this PR?

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! The goal of this PR is to add double-buffering to Unsloth's gradient checkpointing activation reloading, alternating between two GPU buffers (A and B) during the backward pass to overlap H2D copies with gradient computation. As a summary, this PR allocates a second GPU buffer (B), per-buffer CUDA events, and uses CPU_INDEX parity to select which buffer to copy into during backward, while adding OOM fallback, cleanup, and reset logic for the new state.

Benchmark results (Llama-3.2-1B-Instruct, 4-bit LoRA, 61 steps, B200 GPU):

Method Tokens/s Peak Memory Train Loss
Baseline (main) 22,410 1.528 GB 1.3790
PR (double-buf ON) 24,527 1.532 GB 1.3790
PR (double-buf OFF) 24,436 1.530 GB 1.3790

Double-buffer ON vs OFF: +0.4% (within noise). Losses and grad-norms are numerically equivalent across all modes (max delta 0.0035). The double-buffering mechanism is numerically correct and the core idea is sound. The OOM fallback and cleanup paths are safe.

Reviewers Severity Finding
6/8 High Buffer slot selection uses global CPU_INDEX parity, not per-device state -- breaks multi-GPU/device_map
6/8 High torch.cuda.Event() hard-coded; breaks/silently disables on XPU devices
4/8 Medium Broad except Exception on buffer-B resize swallows non-OOM errors
2/8 Low OOM permanently disables double buffering with no recovery path

Concrete suggestions for each finding below.

@ichbinhandsome
Copy link
Copy Markdown
Contributor Author

ichbinhandsome commented Mar 31, 2026

Thanks for your feedbacks! @danielhanchen

Benchmark results (Llama-3.2-1B-Instruct, 4-bit LoRA, 61 steps, B200 GPU):
Double-buffer ON vs OFF: +0.4% (within noise). Losses and grad-norms are numerically equivalent across all modes (max delta 0.0035). The double-buffering mechanism is numerically correct and the core idea is sound. The OOM fallback and cleanup paths are safe.

The results are a bit surprising, as there doesn’t seem to be any performance gain from double buffering.

Could you try benchmarking with a larger model and larger activations (e.g., higher batch size or sequence length)? These cases should involve more H2D copy time with unsloth checkpointing in backward pass and may better highlight any potential benefits.
PS: If you're testing on GB200 (with NVLink between CPU and GPU), the gains may be less noticeable. Using RTX GPU with PCIe we might see some gains.

If we still don’t observe a performance improvement, do you think it’s worth continuing with this PR?

@ichbinhandsome ichbinhandsome force-pushed the double_buffers_backward branch from 836af30 to fbf8fb5 Compare April 7, 2026 19:12
@ichbinhandsome
Copy link
Copy Markdown
Contributor Author

Benchmark results for this PR on an RTX 5080 GPU:

  • Model: unsloth/Qwen3-1.7B-unsloth-bnb-4bit (QLoRA configuration)
  • Max sequence length: 2048
  • Batch size: 24
  • Training steps: 80
Metric Baseline (Single Buffer) Double Buffer Performance Improvement
Train Runtime 266.39s 256.25s 3.81%
Throughput (samples/s) 7.207 7.493 3.97%
Steps/s 0.300 0.312 4.00%
Time per Step 3.33s 3.20s 3.90%

Increasing either the batch size or sequence length leads to better performance, as larger workloads incur more activation reload overhead during the backward pass for H2D memory transfers.

I’ve also addressed all the issues you raised, please let me know if you have any other concerns. @danielhanchen

@Datta0
Copy link
Copy Markdown
Collaborator

Datta0 commented Apr 9, 2026

from my analysis, the memory overhead seems to scale with sequence length. So we should perhaps disable this for low end GPUs that have memory say <16GB? I do see that there are OOM checks for creating buffers but they do not guard against OOMs that happen later right?
Or maybe we can say if after enabling LoRA we still have 4GB ish free memory, we can enable buffers?

also yeah I did notice 5-10% ish performance gains for 8B sized model

@ichbinhandsome
Copy link
Copy Markdown
Contributor Author

ichbinhandsome commented Apr 9, 2026

Thanks for your review! @Datta0

from my analysis, the memory overhead seems to scale with sequence length. So we should perhaps disable this for low end GPUs that have memory say <16GB? I do see that there are OOM checks for creating buffers but they do not guard against OOMs that happen later right?
Or maybe we can say if after enabling LoRA we still have 4GB ish free memory, we can enable buffers?

Good catch on the OOM guards. I've addressed this in the latest commit, OOM during buffer resize is now handled carefully:

  • Buffer A (Single Buffer) resize has OOM protection: if it fails while double buffering is active, it frees buffer B first and retries with single buffer mode
  • Buffer B resize OOM failure frees its memory immediately (resize_(0) + set to None) and falls back to single buffer
  • unpatch and reset now properly clean up buffer B, events, and slot state

The current approach enables double buffering optimistically and always prioritize original single buffer strategy.
Since Buffer resizing only happens during the forward pass, where no data lives in the GPU buffers yet, so if memory pressure arises at any point, it safely degrades to single-buffer mode and reclaims the freed memory for single buffer usage. The backward pass simply checks a flag to use whichever mode is active. This adapts to any GPU and workload without requiring manual thresholds.

Please let me know if you have any other concerns :)

@Datta0
Copy link
Copy Markdown
Collaborator

Datta0 commented Apr 9, 2026

I'm just thinking of an extreme case workload, say on a 16GB machine, our old setup takes ~15 ish GB. We do proactively allocate the second buffer during forward pass, but during backward pass (which consumes more memory than forward I presume), the process OOMs.
I know this is very rare especially given the small small overhead we're adding. But ig we're much safer if we just restrict this only to cases where after original model load, we have a certain amount of memory free (relative or absolute)?

One other thing I couldn't benchmark is, how adding a second buffer effects forward/backward performance in cases where memory pressure is already high?

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18db07a857

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ichbinhandsome
Copy link
Copy Markdown
Contributor Author

ichbinhandsome commented Apr 9, 2026

I'm just thinking of an extreme case workload, say on a 16GB machine, our old setup takes ~15 ish GB. We do proactively allocate the second buffer during forward pass, but during backward pass (which consumes more memory than forward I presume), the process OOMs.
I know this is very rare especially given the small small overhead we're adding. But ig we're much safer if we just restrict this only to cases where after original model load, we have a certain amount of memory free (relative or absolute)?

Make sense. I've added a few safety layers to handle this:

  • Double buffering starts disabled. After the first training step completes (thanks to FIRST_PASS parameter), we check torch.cuda.mem_get_info() (this reflects post-backward peak headroom on GPU for the first training step), and only enable double buffering if CUDA free memory > 512MB (can be changed) for the following training steps.

  • OOM guards on buffer resize: If buffer A or B resize OOMs during forward, buffer B is freed and we fall back to single buffer.

  • Environment variable: For extreme cases we can set UNSLOTH_DISABLE_DOUBLE_BUFFER=1 to skip buffer B allocation entirely, which is more reliable than any heuristic when memory is that tight.

One other thing I couldn't benchmark is, how adding a second buffer effects forward/backward performance in cases where memory pressure is already high?

Yeah, double buffering is now only enabled when CUDA free memory > 512MB after the first training step. Under high memory pressure it stays disabled, so there's zero performance impact. Single buffer is always the default, double buffering is only activated as a bonus when there's enough VRAM headroom.

Feel free to share any suggestions or concerns!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51b878de77

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 135d9dcdc5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown
Collaborator

@Datta0 Datta0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants