prepare_model_for_kbit_training adds ~1 GB CUDA reserved memory in 500 ms — undocumented cost that breaks memory-constrained training on 8 GB unified-memory devices

### Summary

`peft.prepare_model_for_kbit_training` allocates approximately **1024 MB of CUDA reserved memory in 500 ms** during the fp32 upcast of layer norms + gradient-flow setup. On 8 GB unified-memory edge accelerators (Jetson Orin Nano, Apple Silicon, AMD APU), this is a non-trivial fraction of total available memory and is not documented in the PEFT docs or function docstring.

I hit this cliff repeatedly during Mistral-7B-v0.3 QLoRA experiments on NVIDIA Jetson Orin Nano Super 8GB; the +1 GB consumed by `prepare_model_for_kbit_training` is the load-bearing cause of OOM failures in a recipe that would otherwise fit.

### Reproducer

Run on any CUDA host; the absolute number will vary slightly with model + driver + torch version, but the **~1 GB jump at the prepare step** is consistent.

\`\`\`python
import json
import time
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training

T0 = time.time()

def snap(phase: str) -> None:
    print(json.dumps({
        "phase": phase,
        "elapsed_s": round(time.time() - T0, 2),
        "cuda_allocated_mb": torch.cuda.memory_allocated() // (1024 * 1024),
        "cuda_reserved_mb":  torch.cuda.memory_reserved()  // (1024 * 1024),
    }), flush=True)

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

snap("00_start")
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    quantization_config=bnb_cfg,
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    low_cpu_mem_usage=True,
)
snap("01_model_loaded")

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False)
snap("02_after_prepare_for_kbit_training")
\`\`\`

### Observed output (Jetson Orin Nano Super 8GB, L4T 36.4.4, torch 2.5.1, bnb 0.46.1, peft 0.18.0, transformers 4.56.0)

\`\`\`
{"phase": "00_start", "elapsed_s": 0.0, "cuda_allocated_mb": 0, "cuda_reserved_mb": 0}
{"phase": "01_model_loaded", "elapsed_s": 195.3, "cuda_allocated_mb": 3956, "cuda_reserved_mb": 4192}
{"phase": "02_after_prepare_for_kbit_training", "elapsed_s": 195.8, "cuda_allocated_mb": 4468, "cuda_reserved_mb": 5216}
\`\`\`

**Key delta**: `cuda_reserved_mb` jumps **4192 → 5216 MB (+1024 MB) in 500 ms** during `prepare_model_for_kbit_training`. `cuda_allocated_mb` jumps **3956 → 4468 (+512 MB)** in the same window — the gap between allocated and reserved suggests CUDA caching allocator over-reserves for the transient upcast operations and doesn't release the extra reservation.

For a 7B-class model on a discrete-GPU system with 24+ GB, this is irrelevant. For a unified-memory edge accelerator (Jetson SoC) with 8 GB shared between CPU and GPU, **+1 GB CUDA reserved = −1 GB MemAvailable to the host**, and on our Jetson tests this difference was the boundary between "training survives FSDP wrap" and "OOM-killer fires".

### Root cause (best guess from reading the function)

`prepare_model_for_kbit_training` (in `peft/utils/other.py`) does several things:

1. Sets `requires_grad=False` on all non-trainable params (cheap)
2. **Upcasts `LayerNorm` / `RMSNorm` / `embed_norm` params to fp32** ← suspect
3. Registers a forward hook on the LM head to cast outputs back ← cheap
4. Optionally enables gradient checkpointing

The fp32 upcast of norm layers is the prime suspect for the +1 GB CUDA reserved. On Mistral-7B-v0.3:
- 32 transformer layers × 2 RMSNorms/layer × 4096 hidden = ~262K bf16 elements upcast to fp32 per layer
- Total norm params upcast: ~50 MB if just doubled
- BUT the upcast triggers CUDA allocator to reserve fresh fp32 buffers + temporary working memory, and the resulting reservation doesn't get freed

### Expected behavior

At minimum: **document this memory cost** in the `prepare_model_for_kbit_training` docstring and the QLoRA / 4-bit fine-tuning guide.

### Suggested fixes (in order of effort)

**Fix 1 (docs only, easy):** Add a note to the docstring + memory guide documenting the ~1 GB CUDA reserved overhead for 7B-class models.

**Fix 2 (new kwarg, medium):** Add `upcast_norms: bool = True` parameter so users on memory-constrained devices can opt into a lighter prep with `upcast_norms=False`, then upcast only the specific norms LoRA needs.

**Fix 3 (memory-thrifty variant, harder):** Introduce a lean variant that only upcasts norms in modules that have a LoRA adapter attached (after `get_peft_model`), frees intermediate buffers from the upcast operation, and documents the tradeoff.

### Impact

- **Edge fine-tuning on Jetson** (8 GB Orin Nano series, 16 GB Orin AGX): the +1 GB is the difference between fits-recipe and OOM-recipe for 7B-class FSDP1/FSDP2 + LoRA
- **Consumer GPUs** (RTX 4060 8GB, etc.): identical pressure surface
- **Mac fine-tuning on Apple Silicon** (unified memory): same applies
- **Cloud spot/preemptible smaller-GPU tiers** (T4 16GB, etc.): less hit but still meaningful at 13B+ scale

### Reproducibility

Our probe + cluster_tests harness reproduce this deterministically on Jetson Orin Nano Super 8GB across N≥5 runs. Happy to provide additional traces (CUDA memory snapshots, allocator-level forensics) if useful.

### Related issues

- `bitsandbytes` issues #1633 (FSDP2 + AdamW8bit) and #1945 (FSDP2 + Linear4bit forward correctness) — independent issues but in the same memory-budgeting regime
- No prior PEFT issue we could find about `prepare_model_for_kbit_training` memory cost — please close as duplicate if one exists

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prepare_model_for_kbit_training adds ~1 GB CUDA reserved memory in 500 ms — undocumented cost that breaks memory-constrained training on 8 GB unified-memory devices #3265

Summary

Reproducer

Observed output (Jetson Orin Nano Super 8GB, L4T 36.4.4, torch 2.5.1, bnb 0.46.1, peft 0.18.0, transformers 4.56.0)

Root cause (best guess from reading the function)

Expected behavior

Suggested fixes (in order of effort)

Impact

Reproducibility

Related issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

prepare_model_for_kbit_training adds ~1 GB CUDA reserved memory in 500 ms — undocumented cost that breaks memory-constrained training on 8 GB unified-memory devices #3265

Description

Summary

Reproducer

Observed output (Jetson Orin Nano Super 8GB, L4T 36.4.4, torch 2.5.1, bnb 0.46.1, peft 0.18.0, transformers 4.56.0)

Root cause (best guess from reading the function)

Expected behavior

Suggested fixes (in order of effort)

Impact

Reproducibility

Related issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions