Summary
peft.prepare_model_for_kbit_training allocates approximately 1024 MB of CUDA reserved memory in 500 ms during the fp32 upcast of layer norms + gradient-flow setup. On 8 GB unified-memory edge accelerators (Jetson Orin Nano, Apple Silicon, AMD APU), this is a non-trivial fraction of total available memory and is not documented in the PEFT docs or function docstring.
I hit this cliff repeatedly during Mistral-7B-v0.3 QLoRA experiments on NVIDIA Jetson Orin Nano Super 8GB; the +1 GB consumed by prepare_model_for_kbit_training is the load-bearing cause of OOM failures in a recipe that would otherwise fit.
Reproducer
Run on any CUDA host; the absolute number will vary slightly with model + driver + torch version, but the ~1 GB jump at the prepare step is consistent.
```python
import json
import time
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training
T0 = time.time()
def snap(phase: str) -> None:
print(json.dumps({
"phase": phase,
"elapsed_s": round(time.time() - T0, 2),
"cuda_allocated_mb": torch.cuda.memory_allocated() // (1024 * 1024),
"cuda_reserved_mb": torch.cuda.memory_reserved() // (1024 * 1024),
}), flush=True)
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
snap("00_start")
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
quantization_config=bnb_cfg,
torch_dtype=torch.bfloat16,
device_map={"": 0},
low_cpu_mem_usage=True,
)
snap("01_model_loaded")
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False)
snap("02_after_prepare_for_kbit_training")
```
Observed output (Jetson Orin Nano Super 8GB, L4T 36.4.4, torch 2.5.1, bnb 0.46.1, peft 0.18.0, transformers 4.56.0)
```
{"phase": "00_start", "elapsed_s": 0.0, "cuda_allocated_mb": 0, "cuda_reserved_mb": 0}
{"phase": "01_model_loaded", "elapsed_s": 195.3, "cuda_allocated_mb": 3956, "cuda_reserved_mb": 4192}
{"phase": "02_after_prepare_for_kbit_training", "elapsed_s": 195.8, "cuda_allocated_mb": 4468, "cuda_reserved_mb": 5216}
```
Key delta: cuda_reserved_mb jumps 4192 → 5216 MB (+1024 MB) in 500 ms during prepare_model_for_kbit_training. cuda_allocated_mb jumps 3956 → 4468 (+512 MB) in the same window — the gap between allocated and reserved suggests CUDA caching allocator over-reserves for the transient upcast operations and doesn't release the extra reservation.
For a 7B-class model on a discrete-GPU system with 24+ GB, this is irrelevant. For a unified-memory edge accelerator (Jetson SoC) with 8 GB shared between CPU and GPU, +1 GB CUDA reserved = −1 GB MemAvailable to the host, and on our Jetson tests this difference was the boundary between "training survives FSDP wrap" and "OOM-killer fires".
Root cause (best guess from reading the function)
prepare_model_for_kbit_training (in peft/utils/other.py) does several things:
- Sets
requires_grad=False on all non-trainable params (cheap)
- Upcasts
LayerNorm / RMSNorm / embed_norm params to fp32 ← suspect
- Registers a forward hook on the LM head to cast outputs back ← cheap
- Optionally enables gradient checkpointing
The fp32 upcast of norm layers is the prime suspect for the +1 GB CUDA reserved. On Mistral-7B-v0.3:
- 32 transformer layers × 2 RMSNorms/layer × 4096 hidden = ~262K bf16 elements upcast to fp32 per layer
- Total norm params upcast: ~50 MB if just doubled
- BUT the upcast triggers CUDA allocator to reserve fresh fp32 buffers + temporary working memory, and the resulting reservation doesn't get freed
Expected behavior
At minimum: document this memory cost in the prepare_model_for_kbit_training docstring and the QLoRA / 4-bit fine-tuning guide.
Suggested fixes (in order of effort)
Fix 1 (docs only, easy): Add a note to the docstring + memory guide documenting the ~1 GB CUDA reserved overhead for 7B-class models.
Fix 2 (new kwarg, medium): Add upcast_norms: bool = True parameter so users on memory-constrained devices can opt into a lighter prep with upcast_norms=False, then upcast only the specific norms LoRA needs.
Fix 3 (memory-thrifty variant, harder): Introduce a lean variant that only upcasts norms in modules that have a LoRA adapter attached (after get_peft_model), frees intermediate buffers from the upcast operation, and documents the tradeoff.
Impact
- Edge fine-tuning on Jetson (8 GB Orin Nano series, 16 GB Orin AGX): the +1 GB is the difference between fits-recipe and OOM-recipe for 7B-class FSDP1/FSDP2 + LoRA
- Consumer GPUs (RTX 4060 8GB, etc.): identical pressure surface
- Mac fine-tuning on Apple Silicon (unified memory): same applies
- Cloud spot/preemptible smaller-GPU tiers (T4 16GB, etc.): less hit but still meaningful at 13B+ scale
Reproducibility
Our probe + cluster_tests harness reproduce this deterministically on Jetson Orin Nano Super 8GB across N≥5 runs. Happy to provide additional traces (CUDA memory snapshots, allocator-level forensics) if useful.
Related issues
Summary
peft.prepare_model_for_kbit_trainingallocates approximately 1024 MB of CUDA reserved memory in 500 ms during the fp32 upcast of layer norms + gradient-flow setup. On 8 GB unified-memory edge accelerators (Jetson Orin Nano, Apple Silicon, AMD APU), this is a non-trivial fraction of total available memory and is not documented in the PEFT docs or function docstring.I hit this cliff repeatedly during Mistral-7B-v0.3 QLoRA experiments on NVIDIA Jetson Orin Nano Super 8GB; the +1 GB consumed by
prepare_model_for_kbit_trainingis the load-bearing cause of OOM failures in a recipe that would otherwise fit.Reproducer
Run on any CUDA host; the absolute number will vary slightly with model + driver + torch version, but the ~1 GB jump at the prepare step is consistent.
```python
import json
import time
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training
T0 = time.time()
def snap(phase: str) -> None:
print(json.dumps({
"phase": phase,
"elapsed_s": round(time.time() - T0, 2),
"cuda_allocated_mb": torch.cuda.memory_allocated() // (1024 * 1024),
"cuda_reserved_mb": torch.cuda.memory_reserved() // (1024 * 1024),
}), flush=True)
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
snap("00_start")
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
quantization_config=bnb_cfg,
torch_dtype=torch.bfloat16,
device_map={"": 0},
low_cpu_mem_usage=True,
)
snap("01_model_loaded")
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False)
snap("02_after_prepare_for_kbit_training")
```
Observed output (Jetson Orin Nano Super 8GB, L4T 36.4.4, torch 2.5.1, bnb 0.46.1, peft 0.18.0, transformers 4.56.0)
```
{"phase": "00_start", "elapsed_s": 0.0, "cuda_allocated_mb": 0, "cuda_reserved_mb": 0}
{"phase": "01_model_loaded", "elapsed_s": 195.3, "cuda_allocated_mb": 3956, "cuda_reserved_mb": 4192}
{"phase": "02_after_prepare_for_kbit_training", "elapsed_s": 195.8, "cuda_allocated_mb": 4468, "cuda_reserved_mb": 5216}
```
Key delta:
cuda_reserved_mbjumps 4192 → 5216 MB (+1024 MB) in 500 ms duringprepare_model_for_kbit_training.cuda_allocated_mbjumps 3956 → 4468 (+512 MB) in the same window — the gap between allocated and reserved suggests CUDA caching allocator over-reserves for the transient upcast operations and doesn't release the extra reservation.For a 7B-class model on a discrete-GPU system with 24+ GB, this is irrelevant. For a unified-memory edge accelerator (Jetson SoC) with 8 GB shared between CPU and GPU, +1 GB CUDA reserved = −1 GB MemAvailable to the host, and on our Jetson tests this difference was the boundary between "training survives FSDP wrap" and "OOM-killer fires".
Root cause (best guess from reading the function)
prepare_model_for_kbit_training(inpeft/utils/other.py) does several things:requires_grad=Falseon all non-trainable params (cheap)LayerNorm/RMSNorm/embed_normparams to fp32 ← suspectThe fp32 upcast of norm layers is the prime suspect for the +1 GB CUDA reserved. On Mistral-7B-v0.3:
Expected behavior
At minimum: document this memory cost in the
prepare_model_for_kbit_trainingdocstring and the QLoRA / 4-bit fine-tuning guide.Suggested fixes (in order of effort)
Fix 1 (docs only, easy): Add a note to the docstring + memory guide documenting the ~1 GB CUDA reserved overhead for 7B-class models.
Fix 2 (new kwarg, medium): Add
upcast_norms: bool = Trueparameter so users on memory-constrained devices can opt into a lighter prep withupcast_norms=False, then upcast only the specific norms LoRA needs.Fix 3 (memory-thrifty variant, harder): Introduce a lean variant that only upcasts norms in modules that have a LoRA adapter attached (after
get_peft_model), frees intermediate buffers from the upcast operation, and documents the tradeoff.Impact
Reproducibility
Our probe + cluster_tests harness reproduce this deterministically on Jetson Orin Nano Super 8GB across N≥5 runs. Happy to provide additional traces (CUDA memory snapshots, allocator-level forensics) if useful.
Related issues
bitsandbytesissues Prompt Tuning with Zero3 didn't work #1633 (FSDP2 + AdamW8bit) and How to use peft for LoRA ensemble #1945 (FSDP2 + Linear4bit forward correctness) — independent issues but in the same memory-budgeting regimeprepare_model_for_kbit_trainingmemory cost — please close as duplicate if one exists