Skip to content

add: Low-VRAM Avatar 1.5 inference for single RTX 4090#115

Open
1TommyCheung wants to merge 3 commits into
meituan-longcat:mainfrom
1TommyCheung:main
Open

add: Low-VRAM Avatar 1.5 inference for single RTX 4090#115
1TommyCheung wants to merge 3 commits into
meituan-longcat:mainfrom
1TommyCheung:main

Conversation

@1TommyCheung

@1TommyCheung 1TommyCheung commented May 25, 2026

Copy link
Copy Markdown

Summary

  • Single-GPU inference script (run_demo_avatar_single_lowmem.py) that runs LongCat-Video-Avatar-1.5 on a single RTX 4090 (24GB VRAM, 32GB RAM) — the original code requires 2× A100-80GB.
  • ~116x speedup over the original INT8 inference (29 min/step → 14-15s/step) through sequential model offloading + torchao optimized quantization + torch.compile kernel fusion.
  • Zero model code changes — all optimizations are in the inference script only.

Key techniques

  1. Sequential model offloading: loads text encoder (22GB) → encodes prompt → frees → loads audio encoder (3GB) → encodes → frees → loads VAE + INT8 DiT → inference. Peak VRAM stays under 24GB.
  2. Streaming INT8 shard loading: loads 4×4GB weight shards one at a time into the model skeleton, avoiding 30GB+ RAM spike from loading the full state dict.
  3. torchao int8_weight_only(): replaces the naive QuantizedLinear dequant→matmul pattern with fused CUTLASS INT8 tensor core kernels. 8.8× speedup.
  4. torch.compile(mode="max-autotune-no-cudagraphs"): kernel fusion + Triton autotuning. Additional 3.6× on cached runs.
  5. Persistent kernel cache (TORCHINDUCTOR_CACHE_DIR): first run compiles ~7 min, subsequent runs skip.
  6. FP8 weight-only option (--use_fp8): uses Ada Lovelace FP8 tensor cores for ~7% additional speedup.

Performance

Metric Original This PR
Min hardware 2× A100-80GB 1× RTX 4090 24GB
Per denoising step 29 min 14s (FP8) / 15s (INT8)
3.7s video (1 segment) 3h 52m ~2 min
12s video (4 segments) 15.5 hours ~70 min

Usage

# Conda environment setup
conda create -n longcat-video python=3.10
conda activate longcat-video
pip install torch==2.7.1+cu126 torchvision==0.22.1+cu126 torchaudio==2.7.1+cu126 --index-url https://download.pytorch.org/whl/cu126
pip install flash-attn --no-build-isolation --no-binary flash-attn
pip install -r requirements.txt && pip install -r requirements_avatar.txt
pip install torchao accelerate

# Run Avatar 1.5 (single GPU, INT8)
torchrun --nproc_per_node=1 run_demo_avatar_single_lowmem.py \
  --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 \
  --stage_1=ai2v \
  --input_json=assets/avatar/single_example_1.json \
  --use_torchao

# Run with FP8 (RTX 4090 Ada Lovelace)
torchrun --nproc_per_node=1 run_demo_avatar_single_lowmem.py \
  --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 \
  --stage_1=ai2v \
  --input_json=assets/avatar/single_example_1.json \
  --use_torchao --use_fp8

# Docker
docker compose build
docker compose run --service-ports longcat-video \
  run_demo_avatar_single_lowmem.py \
  --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 \
  --stage_1=ai2v --use_torchao

Test plan

  • Single segment (ai2v) generates correct 93-frame video with audio sync
  • Multi-segment (4 segments, 12s) video continuation works
  • INT8 and FP8 quantization both produce valid output
  • Kernel cache persists across runs (.cache/torch_inductor/)
  • No model code modifications — all original files unchanged

- run_demo_avatar_single_lowmem.py: sequential model offloading (text encoder → audio → DiT),
  streaming INT8 shard loading, torchao int8_weight_only + torch.compile optimization.
  Achieves ~116x speedup (29min → 15s per denoising step).
- Dockerfile + docker-compose.yml for containerized GPU inference.
- No model code changes — all optimizations in the inference script.

Performance on RTX 4090:
  Segment 1 (3.7s video): ~2 min (cached kernels)
  12s video (4 segments): ~70 min (vs 15.5h original)
FP8 weight-only via torchao float8_weight_only() as alternative to INT8.
Benchmarked on RTX 4090: ~7% faster (14s vs 15s/step), same VRAM, no quality loss.
Use --use_torchao --use_fp8 to enable.
@1TommyCheung 1TommyCheung changed the title add: Low-VRAM Avatar 1.5 inference for single RTX 4090 (24GB) add: Low-VRAM Avatar 1.5 inference for single RTX 4090 May 25, 2026
Previously accumulated all frames in memory and re-encoded everything
after each segment. By segment 20+, this caused 424s/step (vs 26s normal)
from RAM pressure and 40+ min FFmpeg encodes per segment.

Now saves each segment independently (~80 frames, ~2s encode) and
concatenates at the end via ffmpeg -concat (instant, copy-only).
Expected: constant 26s/step and ~87 min total for 25 segments.
@aamoronatti

Copy link
Copy Markdown

I tested this PR on a RunPod H100 SXM 80GB with LongCat-Video-Avatar 1.5, PyTorch 2.6.0+cu124, Python 3.10.

A few notes:

  1. torchao is not pinned. The latest torchao==0.17.0 does not work with PyTorch 2.6.0:
    AttributeError: module 'torch.utils._pytree' has no attribute 'register_constant'.
    I had to use torchao==0.10.0, which exposes int8_weight_only, float8_weight_only, and quantize_.

  2. I could not reproduce the advertised speedup for actual generation time once weights/audio are loaded.

Benchmark: 480p, 1 segment, Avatar 1.5, distill 8 steps, same image/audio/prompt, measuring only the generate_ai2v call:

Mode Generation time
Official script + --use_int8 62.7s
PR lowmem script, no torchao 62.1s
PR + torchao INT8, first run 397.6s
PR + torchao INT8, second warm run 64.6s
PR + torchao FP8, first run 429.1s
PR + torchao FP8, second warm run 65.5s

So in this setup, torchao/torch.compile did not speed up generation. The first run is much slower due to compilation/autotune, and the warm run is slightly slower than the official INT8 path.

The per-segment encoding/concat change may still be useful for long videos because the official script re-encodes accumulated frames after each segment, which can cause avoidable overhead. But that is different from accelerating denoising/generation.

Could you clarify what exactly was benchmarked for the claimed 29min -> 15s per denoising step speedup?

@1TommyCheung

Copy link
Copy Markdown
Author

Thanks for the careful benchmark — this is really useful, and it actually lines up with our results once two things are pinned down. You're right that the headline framing was misleading; let me correct it.

1. Unit mismatch (per-step vs per-generation). Our numbers are per denoising step; yours are total generate_ai2v (= 8 distill steps). Reconciling:

  • Your H100: 62s / 8 ≈ 7.8 s/step
  • Our 4090 + torchao: 14–15 s/step

That ~2× is just H100-vs-4090 hardware — so our "fast" numbers actually agree with yours. (Sanity check on the convention: 29 min/step × 8 = 3h52m, matching the "original" row in the table.)

2. The 29 min was an environment artifact, not a torchao compute win — you're correct. Same no-torchao code: your H100 = 62s, our 4090 = 29 min. That 28× can't be hardware (bandwidth/compute differ ~3×). The cause is memory regime:

  • The repo's QuantizedLinear.forward rematerializes a full bf16 weight from int8 on every forward.
  • We run on a 24GB 4090 under WSL2, where the driver's sysmem-fallback is on by default: once the working set exceeds VRAM it spills to host RAM over PCIe instead of OOMing. The per-forward bf16 materialization tips it over 24GB, so weights page GPU↔host every step → 29 min.
  • torchao's int8_weight_only fuses dequant into the GEMM epilogue and never materializes the bf16 weight, so the working set stays resident → 15 s/step.
  • On your 80GB H100 there's no spill to fix, so torchao is correctly a no-op (and slightly slower warm + a one-time compile cost). Your result isn't contradicting the PR — it's confirming the benefit is memory-pressure relief specific to ≤24GB cards, not a kernel speedup.

So the accurate claim is narrower: this lets Avatar 1.5 run on a single 24GB GPU at ~15 s/step, where the repo's own INT8 path degrades badly under VRAM pressure; on a GPU with headroom, stock INT8 is already fine and torchao adds nothing. I'll reword the PR description and table accordingly and gate the speedup claim to the 24GB case.

3. Pinning — agreed, this is a real gap. torchao isn't pinned and requirements.txt is inconsistent (pins torch==2.6.0 while the usage block installs 2.7.1). I'll pin a working matrix: our validated combo is torch==2.7.1 + torchao==0.7; your torch==2.6.0 + torchao==0.10.0 is a second known-good point. 0.17.0 is broken on 2.6.0 as you found.

4. Per-segment encoding — agreed it's the portable win independent of the GPU, and I'll keep that framed separately from generation time.

Thanks again — this tightens the PR considerably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants