Double-buffer GPU activations for overlapping H2D copy with backward compute by ichbinhandsome · Pull Request #534 · unslothai/unsloth-zoo

ichbinhandsome · 2026-03-06T19:07:28Z

This PR implements a double-buffering optimization for activation reloading during the backward pass of gradient checkpointing with use_gradient_checkpointing="unsloth". This means any model fine-tuned with Unsloth's gradient checkpointing will benefit from this PR.

Problem: With the current single-buffer implementation, the H2D memory copy for reloading offloaded activations during backward cannot overlap with GPU computation. The copy stream must wait for the compute stream to finish using the shared GPU buffer before writing the next activation into it, creating GPU idle bubbles.

Solution: Allocate a second GPU buffer (GPU_BUFFERS_B) and alternate between the two buffers based on layer index (CPU_INDEX % 2) with cuda event to prevent race condition. While the compute stream operates on buffer A, the copy stream can simultaneously load the next activation into buffer B, and vice versa — fully overlapping H2D transfers with backward computation.

e.g.

Layer 23 backward:
  1. H2D copy into buffer A (EXTRA_STREAM)
  2. MAIN_STREAM.wait_stream(EXTRA_STREAM)  // wait for copy
  3. Compute on buffer A                   
Layer 22 backward:
  1. H2D copy into buffer B (EXTRA_STREAM)  // overlaps with step 3 above
  2. MAIN_STREAM.wait_stream(EXTRA_STREAM)
  3. Compute on buffer B                    
Layer 21 backward:
  1. H2D copy into buffer A (EXTRA_STREAM)  // Layer 23's compute already done
  2. MAIN_STREAM.wait_stream(EXTRA_STREAM)
  3. Compute on buffer A

Here are the nsys profiling traces:

Without this PR

With this PR

Performance:

This method doesn’t take much memory as it only takes a buffer size in VRAM, e.g. in gpt-oss-20B model fine-tuning, one buffer takes around 47.18MB with batch size 8.
This PR also implemented a smart double buffering enabling strategy, i.e. if there are enough VRAM we will enable double-buffering otherwise disable.
In my experiments, this achieves approximately 10% speedup per backward pass during GPT-OSS-20B fine-tuning, though results may vary across different models, hyperparameters (e.g. seq_length, batch size etc.) and hardware configurations.

gemini-code-assist · 2026-03-06T19:07:44Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes the gradient checkpointing process in Unsloth by addressing a performance bottleneck where H2D memory transfers for reloaded activations could not overlap with GPU computation. By introducing a second GPU buffer and intelligently managing its allocation and usage, the system can now perform data transfers concurrently with compute operations, leading to a more efficient backward pass and faster fine-tuning times.

Highlights

Double-Buffering Implementation: Introduced a double-buffering mechanism for GPU activations during the backward pass of gradient checkpointing, utilizing a second set of GPU buffers (GPU_BUFFERS_B).
H2D Copy Overlap: Enabled overlapping of Host-to-Device (H2D) memory copies with GPU computation by alternating between two GPU buffers, significantly reducing GPU idle time.
Dynamic Double-Buffering: Implemented a smart strategy to dynamically enable double-buffering if sufficient VRAM is available, and gracefully disable it if memory allocation fails.
Performance Improvement: Achieved an approximate 10% speedup per backward pass during GPT-OSS-20B fine-tuning, demonstrating a notable performance gain.

Changelog

unsloth_zoo/gradient_checkpointing.py
- Introduced global variables GPU_BUFFERS_B and USE_DOUBLE_BUFFER to manage the second GPU buffer and its activation status.
- Implemented conditional allocation of GPU_BUFFERS_B during initialization, enabling double-buffering if memory permits, otherwise falling back to single-buffering.
- Modified the forward pass to dynamically resize GPU_BUFFERS_B and disable double-buffering if an out-of-memory error occurs during resizing.
- Updated the backward pass to alternate between GPU_BUFFERS and GPU_BUFFERS_B for H2D copies, based on the layer index, when double-buffering is active.
- Removed the explicit stream synchronization (EXTRA_STREAM.wait_stream(MAIN_STREAM)) in the backward pass when double-buffering is enabled, allowing H2D copies to overlap with compute.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b38af64377

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

unsloth_zoo/gradient_checkpointing.py

gemini-code-assist

Code Review

The pull request introduces a valuable double-buffering optimization for GPU activations, effectively overlapping H2D memory copies with backward computation, aiming for significant performance enhancement. However, a high-severity security finding has been identified: the current implementation contains a race condition. The code fails to synchronize the copy stream with the compute stream when reusing buffers, which can lead to data corruption as a new copy may overwrite a buffer still being read by the GPU. Proper synchronization is recommended. Additionally, there are opportunities to improve error handling by catching more specific exceptions and to refine logging to prevent excessive output.

unsloth_zoo/gradient_checkpointing.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b359e1856e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

unsloth_zoo/gradient_checkpointing.py

ichbinhandsome · 2026-03-06T20:25:29Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a double-buffering optimization for GPU activations during the backward pass, which is a solid approach to improve performance by overlapping memory copies with computation. The implementation correctly uses CUDA events for fine-grained synchronization. My review includes a few suggestions to enhance code quality and robustness, such as replacing magic numbers with predefined constants, avoiding bare except clauses, and refactoring some conditional logic for better conciseness.

unsloth_zoo/gradient_checkpointing.py

ichbinhandsome · 2026-03-06T20:36:02Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a double-buffering optimization for GPU activations to overlap H2D memory copies with backward computation, which is a solid performance enhancement. The implementation correctly uses a second GPU buffer and CUDA events for synchronization. The fallback mechanism to disable double buffering on OOM errors is also well-handled. My feedback focuses on improving the user-facing print statements to make them clearer and less repetitive, providing better feedback during initialization and in case of runtime errors. The provided rules did not require any modifications or dropping of the original comments.

unsloth_zoo/gradient_checkpointing.py

mmathew23 · 2026-03-11T04:57:52Z

Hi @ichbinhandsome, thanks for the PR! I made a commit related to freeing up some dangling GPU refs. Please take a look when you get a chance.

ichbinhandsome · 2026-03-11T18:07:19Z

Hi @ichbinhandsome, thanks for the PR! I made a commit related to freeing up some dangling GPU refs. Please take a look when you get a chance.

Thanks @mmathew23 for the review! These fixes look good to me. I also addressed a small global variable scope issue here: 836af30

Besides that, I have a few questions regarding @danielhanchen 's commits:

Remove print("Unsloth: Double buffering enabled...") that fires on every buffer resize during warmup (once per layer), not once per run.

Agreed on removing the print. One clarification on the buffer resizing behavior though: I believe the resize only happens at most once per training step, not once per layer. Since Unsloth offloads each layer's activations from GPU to CPU, and all layers produce activations of the same size, the buffer only needs to resize from its small initial size on the very first layer. After that, every subsequent layer reuses the same buffer size.

Across different training steps, the buffer size can differ because Unsloth uses packing (padding-free), so size of activations may vary per batch. But since the buffer only grows (resizes when new_size > buffer.numel()), it will monotonically increase to the largest activation seen so far and stay there. If the buffer required for the current batch is smaller than the allocated buffer size, the resize operation will not be triggered.

So the print was indeed noisy, but the reason is likely that it fired on the first layer of every step (before the buffer grew to full size on the very first step), not that it fired on every layer. Is my understanding correct?

Reset USE_DOUBLE_BUFFER in reset_unsloth_gradient_checkpointing_buffers()
so that if OOM disabled double buffering mid-run, subsequent trainer.train()
calls re-enable it when buffer B still exists.

In the current code, when OOM disables double buffering, it sets both USE_DOUBLE_BUFFER = False and GPU_BUFFERS_B = None, so by the time we reach this check, GPU_BUFFERS_B is None and the re-enable never triggers. Please correct me if I was wrong. (@mmathew23 for the buffer freeing up)
I am happy to keep it for safety, just wanted to flag it :)

mmathew23 · 2026-03-12T18:33:36Z

yes thanks for flagging. I guess we can keep it for safety :)

ichbinhandsome · 2026-03-30T14:19:10Z

Hi @mmathew23 @danielhanchen I just got back from vacation and wanted to quickly check on the status of this PR. Have you also observed any performance improvements through this PR?

danielhanchen

Thank you for the PR! The goal of this PR is to add double-buffering to Unsloth's gradient checkpointing activation reloading, alternating between two GPU buffers (A and B) during the backward pass to overlap H2D copies with gradient computation. As a summary, this PR allocates a second GPU buffer (B), per-buffer CUDA events, and uses CPU_INDEX parity to select which buffer to copy into during backward, while adding OOM fallback, cleanup, and reset logic for the new state.

Benchmark results (Llama-3.2-1B-Instruct, 4-bit LoRA, 61 steps, B200 GPU):

Method	Tokens/s	Peak Memory	Train Loss
Baseline (main)	22,410	1.528 GB	1.3790
PR (double-buf ON)	24,527	1.532 GB	1.3790
PR (double-buf OFF)	24,436	1.530 GB	1.3790

Double-buffer ON vs OFF: +0.4% (within noise). Losses and grad-norms are numerically equivalent across all modes (max delta 0.0035). The double-buffering mechanism is numerically correct and the core idea is sound. The OOM fallback and cleanup paths are safe.

Reviewers	Severity	Finding
6/8	High	Buffer slot selection uses global `CPU_INDEX` parity, not per-device state -- breaks multi-GPU/device_map
6/8	High	`torch.cuda.Event()` hard-coded; breaks/silently disables on XPU devices
4/8	Medium	Broad `except Exception` on buffer-B resize swallows non-OOM errors
2/8	Low	OOM permanently disables double buffering with no recovery path

Concrete suggestions for each finding below.

unsloth_zoo/gradient_checkpointing.py

ichbinhandsome · 2026-03-31T14:51:08Z

Thanks for your feedbacks! @danielhanchen

Benchmark results (Llama-3.2-1B-Instruct, 4-bit LoRA, 61 steps, B200 GPU):
Double-buffer ON vs OFF: +0.4% (within noise). Losses and grad-norms are numerically equivalent across all modes (max delta 0.0035). The double-buffering mechanism is numerically correct and the core idea is sound. The OOM fallback and cleanup paths are safe.

The results are a bit surprising, as there doesn’t seem to be any performance gain from double buffering.

Could you try benchmarking with a larger model and larger activations (e.g., higher batch size or sequence length)? These cases should involve more H2D copy time with unsloth checkpointing in backward pass and may better highlight any potential benefits.
PS: If you're testing on GB200 (with NVLink between CPU and GPU), the gains may be less noticeable. Using RTX GPU with PCIe we might see some gains.

If we still don’t observe a performance improvement, do you think it’s worth continuing with this PR?

ichbinhandsome · 2026-04-07T19:16:20Z

Benchmark results for this PR on an RTX 5080 GPU:

Model: unsloth/Qwen3-1.7B-unsloth-bnb-4bit (QLoRA configuration)
Max sequence length: 2048
Batch size: 24
Training steps: 80

Metric	Baseline (Single Buffer)	Double Buffer	Performance Improvement
Train Runtime	266.39s	256.25s	3.81%
Throughput (samples/s)	7.207	7.493	3.97%
Steps/s	0.300	0.312	4.00%
Time per Step	3.33s	3.20s	3.90%

Increasing either the batch size or sequence length leads to better performance, as larger workloads incur more activation reload overhead during the backward pass for H2D memory transfers.

I’ve also addressed all the issues you raised, please let me know if you have any other concerns. @danielhanchen

Datta0 · 2026-04-09T10:31:48Z

from my analysis, the memory overhead seems to scale with sequence length. So we should perhaps disable this for low end GPUs that have memory say <16GB? I do see that there are OOM checks for creating buffers but they do not guard against OOMs that happen later right?
Or maybe we can say if after enabling LoRA we still have 4GB ish free memory, we can enable buffers?

also yeah I did notice 5-10% ish performance gains for 8B sized model

ichbinhandsome · 2026-04-09T13:01:42Z

Thanks for your review! @Datta0

from my analysis, the memory overhead seems to scale with sequence length. So we should perhaps disable this for low end GPUs that have memory say <16GB? I do see that there are OOM checks for creating buffers but they do not guard against OOMs that happen later right?
Or maybe we can say if after enabling LoRA we still have 4GB ish free memory, we can enable buffers?

Good catch on the OOM guards. I've addressed this in the latest commit, OOM during buffer resize is now handled carefully:

Buffer A (Single Buffer) resize has OOM protection: if it fails while double buffering is active, it frees buffer B first and retries with single buffer mode
Buffer B resize OOM failure frees its memory immediately (resize_(0) + set to None) and falls back to single buffer
unpatch and reset now properly clean up buffer B, events, and slot state

The current approach enables double buffering optimistically and always prioritize original single buffer strategy.
Since Buffer resizing only happens during the forward pass, where no data lives in the GPU buffers yet, so if memory pressure arises at any point, it safely degrades to single-buffer mode and reclaims the freed memory for single buffer usage. The backward pass simply checks a flag to use whichever mode is active. This adapts to any GPU and workload without requiring manual thresholds.

Please let me know if you have any other concerns :)

Datta0 · 2026-04-09T15:23:06Z

I'm just thinking of an extreme case workload, say on a 16GB machine, our old setup takes ~15 ish GB. We do proactively allocate the second buffer during forward pass, but during backward pass (which consumes more memory than forward I presume), the process OOMs.
I know this is very rare especially given the small small overhead we're adding. But ig we're much safer if we just restrict this only to cases where after original model load, we have a certain amount of memory free (relative or absolute)?

One other thing I couldn't benchmark is, how adding a second buffer effects forward/backward performance in cases where memory pressure is already high?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18db07a857

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

unsloth_zoo/gradient_checkpointing.py

ichbinhandsome · 2026-04-09T18:24:58Z

I'm just thinking of an extreme case workload, say on a 16GB machine, our old setup takes ~15 ish GB. We do proactively allocate the second buffer during forward pass, but during backward pass (which consumes more memory than forward I presume), the process OOMs.
I know this is very rare especially given the small small overhead we're adding. But ig we're much safer if we just restrict this only to cases where after original model load, we have a certain amount of memory free (relative or absolute)?

Make sense. I've added a few safety layers to handle this:

Double buffering starts disabled. After the first training step completes (thanks to FIRST_PASS parameter), we check torch.cuda.mem_get_info() (this reflects post-backward peak headroom on GPU for the first training step), and only enable double buffering if CUDA free memory > 512MB (can be changed) for the following training steps.
OOM guards on buffer resize: If buffer A or B resize OOMs during forward, buffer B is freed and we fall back to single buffer.
Environment variable: For extreme cases we can set UNSLOTH_DISABLE_DOUBLE_BUFFER=1 to skip buffer B allocation entirely, which is more reliable than any heuristic when memory is that tight.

One other thing I couldn't benchmark is, how adding a second buffer effects forward/backward performance in cases where memory pressure is already high?

Yeah, double buffering is now only enabled when CUDA free memory > 512MB after the first training step. Under high memory pressure it stays disabled, so there's zero performance impact. Single buffer is always the default, double buffering is only activated as a bonus when there's enough VRAM headroom.

Feel free to share any suggestions or concerns!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51b878de77

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

unsloth_zoo/gradient_checkpointing.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 135d9dcdc5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

unsloth_zoo/gradient_checkpointing.py

Datta0

LGTM !

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

unsloth_zoo/gradient_checkpointing.py Show resolved Hide resolved

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

unsloth_zoo/gradient_checkpointing.py Show resolved Hide resolved

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

unsloth_zoo/gradient_checkpointing.py Show resolved Hide resolved

danielhanchen requested changes Mar 31, 2026

View reviewed changes

ichbinhandsome added 4 commits April 7, 2026 16:35

add doube buffers on GPU for backward pass

4a167f6

use cuda event to avoid race condition

8176e25

fix as suggestions

6210ff0

fix issues as suggested

fbf8fb5

ichbinhandsome force-pushed the double_buffers_backward branch from 836af30 to fbf8fb5 Compare April 7, 2026 19:12

ichbinhandsome requested a review from danielhanchen April 7, 2026 19:17

Free buffer B on OOM and prioritize single-buffer fallback

25bf1d6

ichbinhandsome added 2 commits April 9, 2026 17:28

checking free VRAM to enable double buffering

710cc2f

add env variable to disable double buffer

18db07a

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

unsloth_zoo/gradient_checkpointing.py Show resolved Hide resolved

fix issues

51b878d

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

fix issues

135d9dc

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

unsloth_zoo/gradient_checkpointing.py Outdated Show resolved Hide resolved

fix as suggestions

ec5e7a4

Datta0 approved these changes Apr 10, 2026

View reviewed changes

Conversation

ichbinhandsome commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance:

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ichbinhandsome commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ichbinhandsome commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mmathew23 commented Mar 11, 2026

Uh oh!

ichbinhandsome commented Mar 11, 2026

Uh oh!

mmathew23 commented Mar 12, 2026

Uh oh!

ichbinhandsome commented Mar 30, 2026

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ichbinhandsome commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ichbinhandsome commented Apr 7, 2026

Uh oh!

Datta0 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ichbinhandsome commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Datta0 commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

ichbinhandsome commented Mar 6, 2026 •

edited

Loading

ichbinhandsome commented Mar 31, 2026 •

edited

Loading

Datta0 commented Apr 9, 2026 •

edited

Loading

ichbinhandsome commented Apr 9, 2026 •

edited

Loading

ichbinhandsome commented Apr 9, 2026 •

edited

Loading