add: Low-VRAM Avatar 1.5 inference for single RTX 4090#115
add: Low-VRAM Avatar 1.5 inference for single RTX 4090#1151TommyCheung wants to merge 3 commits into
Conversation
- run_demo_avatar_single_lowmem.py: sequential model offloading (text encoder → audio → DiT), streaming INT8 shard loading, torchao int8_weight_only + torch.compile optimization. Achieves ~116x speedup (29min → 15s per denoising step). - Dockerfile + docker-compose.yml for containerized GPU inference. - No model code changes — all optimizations in the inference script. Performance on RTX 4090: Segment 1 (3.7s video): ~2 min (cached kernels) 12s video (4 segments): ~70 min (vs 15.5h original)
FP8 weight-only via torchao float8_weight_only() as alternative to INT8. Benchmarked on RTX 4090: ~7% faster (14s vs 15s/step), same VRAM, no quality loss. Use --use_torchao --use_fp8 to enable.
Previously accumulated all frames in memory and re-encoded everything after each segment. By segment 20+, this caused 424s/step (vs 26s normal) from RAM pressure and 40+ min FFmpeg encodes per segment. Now saves each segment independently (~80 frames, ~2s encode) and concatenates at the end via ffmpeg -concat (instant, copy-only). Expected: constant 26s/step and ~87 min total for 25 segments.
|
I tested this PR on a RunPod H100 SXM 80GB with LongCat-Video-Avatar 1.5, PyTorch 2.6.0+cu124, Python 3.10. A few notes:
Benchmark: 480p, 1 segment, Avatar 1.5, distill 8 steps, same image/audio/prompt, measuring only the
So in this setup, The per-segment encoding/concat change may still be useful for long videos because the official script re-encodes accumulated frames after each segment, which can cause avoidable overhead. But that is different from accelerating denoising/generation. Could you clarify what exactly was benchmarked for the claimed |
|
Thanks for the careful benchmark — this is really useful, and it actually lines up with our results once two things are pinned down. You're right that the headline framing was misleading; let me correct it. 1. Unit mismatch (per-step vs per-generation). Our numbers are per denoising step; yours are total
That ~2× is just H100-vs-4090 hardware — so our "fast" numbers actually agree with yours. (Sanity check on the convention: 29 min/step × 8 = 3h52m, matching the "original" row in the table.) 2. The 29 min was an environment artifact, not a torchao compute win — you're correct. Same no-torchao code: your H100 = 62s, our 4090 = 29 min. That 28× can't be hardware (bandwidth/compute differ ~3×). The cause is memory regime:
So the accurate claim is narrower: this lets Avatar 1.5 run on a single 24GB GPU at ~15 s/step, where the repo's own INT8 path degrades badly under VRAM pressure; on a GPU with headroom, stock INT8 is already fine and torchao adds nothing. I'll reword the PR description and table accordingly and gate the speedup claim to the 24GB case. 3. Pinning — agreed, this is a real gap. 4. Per-segment encoding — agreed it's the portable win independent of the GPU, and I'll keep that framed separately from generation time. Thanks again — this tightens the PR considerably. |
Summary
run_demo_avatar_single_lowmem.py) that runs LongCat-Video-Avatar-1.5 on a single RTX 4090 (24GB VRAM, 32GB RAM) — the original code requires 2× A100-80GB.Key techniques
int8_weight_only(): replaces the naiveQuantizedLineardequant→matmul pattern with fused CUTLASS INT8 tensor core kernels. 8.8× speedup.torch.compile(mode="max-autotune-no-cudagraphs"): kernel fusion + Triton autotuning. Additional 3.6× on cached runs.TORCHINDUCTOR_CACHE_DIR): first run compiles ~7 min, subsequent runs skip.--use_fp8): uses Ada Lovelace FP8 tensor cores for ~7% additional speedup.Performance
Usage
Test plan
.cache/torch_inductor/)