@article{wang2026stac,
title={STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction},
author={Wang, Runze and Song, Yuxuan and Cai, Youcheng and Liu, Ligang},
journal={arXiv preprint arXiv:2603.20284},
year={2026}
}STAC is a plug-and-play KV-cache compression framework for memory-efficient streaming 3D reconstruction over long videos. It compresses evicted KV-cache tokens into a spatio-temporal voxel memory and retrieves relevant pivots on demand, enabling bounded-memory long-range spatial reasoning.
Feed-forward 3D models such as STream3R and StreamVGGT scale poorly on long videos due to
| Capability | Causal | Window | STAC (Ours) |
|---|---|---|---|
| Attention | All frames | Sliding window | Window + voxel retrieval |
| Memory scaling |
|
|
|
| Long-video support | β (OOM) | β (no history) | β (with spatial memory) |
Overview: STAC with Causal-VGGT and runtime-memory scaling.
Supported backbones (switch via --base_model): STream3R (stream3r) and StreamVGGT (streamvggt).
- Plug-and-play: Switch backbones via
--base_modelwith no code changes. - Memory-constrained: Temporal cache + voxel memory keeps KV growth bounded on long streams.
- Efficient inference: Chunk-based
StreamSessionwith optional CUDA backends for stable latency and higher throughput.
STAC/
βββ main.py # Minimal inference entry point
βββ model_wrapper.py # load_model() / run_model() API
βββ stream_session.py # Chunk-by-chunk streaming session
βββ requirements.txt
βββ stac/ # STAC KV-cache compression (plug-and-play)
β βββ kv_manager.py # Sliding window + H2O token selection
β βββ h2o.py # Heavy-hitter scoring
β βββ stac_voxel.py # Voxel pool: evict -> merge -> retrieve
β βββ merger.py # Token merging operations
β βββ voxel.py # Voxel grid utilities
β βββ allocator.py # Static / slab / segment allocators
β βββ flash_attn_triton.py # Triton attention kernel
βββ causalvggt/ # Causal-VGGT adapter
βββ attn-cuda/ # Custom CUDA attention kernel
βββ merger-cuda/ # Custom CUDA merger kernel
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Backbone (interchangeable) β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββ β
β β STream3R β β StreamVGGT β β (others) β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββββ¬βββββββββ β
β βββββββββββββββββΌββββββββββββββββββ β
β βΌ β
β CausalVGGT Adapter (vggt.py) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CausalAggregator (24-layer ViT-L) β β
β β ββ SparseAttention -> kv_manager (registered) β β
β β CameraHead -> extrinsic + intrinsic β β
β β DPTHead (x2) -> depth map + point map β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β KV pairs
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAC KV-Cache (stac/) <- plug-and-play β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β KVManager sliding window (recent + pinned) β β
β β β H2O heavy-hitter selection β β
β β β STACVoxel 3D voxel pool: evict->merge->retrieve β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β StreamSession (stream_session.py) β
β Chunk-by-chunk inference + prediction accumulation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tested GPUs: NVIDIA RTX 3090 (24 GB) and A100 (40 GB).
git clone https://github.com/Rainzor/STAC.git
cd STAC
conda create -n stac python=3.11 cmake=3.14.0 -y
conda activate stacInstall PyTorch for your CUDA (e.g. cu128 or cu118), then dependencies:
# Example: CUDA 12.8
pip install torch==2.7.0+cu128 torchvision==0.22.0+cu128 torchaudio==2.7.0+cu128 \
--index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txtmerger-cuda is an optional CUDA extension for faster voxel merging (--voxel_backend cuda).
Build from repo root (with CUDA_HOME set):
pip install -e merger-cuda --no-build-isolationattn-cuda is an optional CUDA extension used by STAC attention decoding. It provides:
- FlashAttention forward (
out,lse) - Optional vector bias (
[B,H,N]/[B,H,1,N]/[1,H,1,N]) - Optional column-sum (
colsum) for retrieval scoring - Optional colsum subsampling (
subsample_ratio)
Build from repo root:
pip install -e attn-cuda --no-build-isolationPrepare checkpoints and datasets in the following layout so the evaluation and inference scripts can find them directly.
Place backbone weights under ckpt/{stream3r|streamvggt}/ as model.safetensors, model.pt, or model.pth (auto-detected by model_wrapper.py).
| Backbone | Hugging Face |
|---|---|
| STream3R | yslan/STream3R (default) |
| StreamVGGT | lch01/StreamVGGT |
# Download at least one backbone (run from repo root)
mkdir -p ckpt/stream3r && hf download yslan/STream3R --local-dir ckpt/stream3r
# StreamVGGT: mkdir -p ckpt/streamvggt && hf download lch01/StreamVGGT --local-dir ckpt/streamvggt
# Use HF_ENDPOINT=https://hf-mirror.com for mirrors.Put scenes under data/ with layout data/<dataset>/<scene>/images/*.png, for example data/7scenes/chess/images/.
- Supported datasets:
7scenes,neural_rgbd,DTU,tum,scannet,sintel,bonn,kitti, preprocessing follows CUT3R - Ready-to-use evaluation sets are available on π€ Hugging Face datasets
Suggested layout:
STAC/ # run all commands from repo root
βββ ckpt/
β βββ stream3r/
β β βββ model.safetensors # or model.pt / model.pth
β βββ streamvggt/
β βββ model.safetensors
βββ data/
β βββ <dataset>/<scene>/images/*.png
βββ eval_recon/ # 3D recon output (created by launch)
βββ eval_cam_results/ # pose output
βββ eval_depth/ # depth output
Run from repo root. Example script:
import torch
from pathlib import Path
from eval.utils.image import load_scene_images
from model_wrapper import load_model, run_model
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" and torch.cuda.get_device_capability()[0] >= 8 else torch.float16
scene_dir = Path("data/neural_rgbd/whiteroom") # should contain images/*.png or *.jpg
# Use the same resize/crop pipeline as eval launch scripts.
images = load_scene_images(scene_dir, size=518)[::10]
# Pick any supported backbone: "stream3r" or "streamvggt"
model = load_model("causalvggt", base_model="stream3r", device=device)
# Optional: override checkpoint location (file or directory)
# model = load_model("causalvggt", base_model="stream3r", device=device, model_path="/path/to/model.pth")
# mode="stac" auto-enables streaming + recommended STAC params
with torch.no_grad(), torch.amp.autocast(device_type="cuda", dtype=dtype):
predictions = run_model(
model=model,
images=images,
model_name="causalvggt",
mode="stac",
streaming=True,
dtype=dtype,
device=device,
pinned=[0], # required by streaming STAC path
)
# predictions keys: extrinsic, intrinsic, depth, depth_conf,
# world_points, world_points_conf, timing, merger, ...Switching backbones is a one-line change:
# Use StreamVGGT backbone instead
model = load_model("causalvggt", base_model="streamvggt", device=device)main.py provides a minimal inference example on a scene folder. Scene dirs need an images/ subfolder with .png or .jpg files. Eval scripts (see Evaluation) add dataset loading and metrics on top of the same interface.
# Minimal run (default mode is STAC)
python main.py --scene_dir /path/to/scene
# Full attention baseline (no streaming)
python main.py --scene_dir /path/to/scene --mode full
# Explicit STAC configuration (equivalent to --mode stac)
python main.py --scene_dir /path/to/scene \
--base_model stream3r --streaming \
--mode window_chunk_merge \
-win 4 -ck 4 -hh 2 -ret_sz 2 -ret_buf
# Use StreamVGGT backbone
python main.py --scene_dir /path/to/scene --base_model streamvggt --mode stacCommand Line Arguments for main.py
Path to the scene directory containing an images/ subfolder. Required.
Directory to save outputs. Default: same as --scene_dir.
Backbone weights to use: stream3r or streamvggt. Default: stream3r.
Input resolution. Choices: 224, 512, 518. Default: 518.
Sample every k frames for limited memory inference. Default: 10.
Attention mode (stac, full, causal, window_kv, window_chunk_merge, ...). Default: stac.
Enable frame-by-frame streaming via StreamSession. Off by default (auto-enabled by --mode stac).
Autocast dtype: auto, fp16, or bf16. auto selects bf16 on Ampere+ GPUs, otherwise fp16. Default: auto.
Sliding KV window size in frames. Default: 0.
Number of frames per forward pass. Default: 1.
Heavy-hitter frames kept by H2O. Default: 0.
Voxel pivots retrieved per step. Default: 0.
Include retrieved pivots in the returned buffer. Off by default.
Frame indices pinned in the KV cache. Default: [0].
H2O score temperature. Default: 0.9.
Sparse decode attention backend: cuda or triton. Default: cuda.
Colsum subsampling ratio in (0, 1]. Default: 1.0.
Voxel grid resolution in meters. Default: 0.05.
Initial voxel pool size. Default: 4096.
Confidence threshold for voxel merging. Default: 2.0.
Maximum KV entries per buffer voxel. Default: 8.
Maximum KV entries per pivot voxel. Default: 4.
Voxel backend: python or cuda. Default: cuda.
Voxel allocator: static, slab, or segment. Default: segment.
Use --base_model to switch backbones (stream3r or streamvggt). Batch scripts run all scenes; single-run scripts let you pick individual scenes. For programmatic use, see model_wrapper.run_model() and the stac / stream_session APIs.
The argument lists below cover only task- and dataset-specific options. All three scripts also accept the same STAC/streaming flags (--mode, --streaming, -win, -ck, -hh, -ret_sz, --voxel_*, --allocator, --attn_backend, etc.); see the main.py arguments or each script's --help for the full set.
Batch run: eval/long_recon/run.sh
# STAC (recommended)
python eval/long_recon/launch.py \
--dataset_type NRGBD --scene_name complete_kitchen \
--model_name causalvggt --base_model stream3r \
--mode stac --streaming
# Custom STAC config
python eval/long_recon/launch.py \
--dataset_type NRGBD --scene_name complete_kitchen \
--model_name causalvggt --base_model stream3r \
--mode window_chunk_merge --streaming \
-ck 4 -win 4 -hh 2 -ret_sz 2 -ret_bufCommand Line Arguments for eval/long_recon/launch.py
Model variant. Default: causalvggt.
Backbone to wrap: stream3r or streamvggt.
Directory to save evaluation results. Default: eval_results/recon.
Dataset to evaluate. Required. Choices: 7scenes, NRGBD, DTU.
Specific scene(s) to evaluate. Default: all scenes.
Input resolution (long side). Default: 518.
Keyframe sampling interval. Default: 1.
Run evaluation on CPU instead of CUDA.
Evaluate depth map metrics (print only).
Evaluate camera trajectory metrics (print only).
Disable reconstruction evaluation.
Sub-folder tag appended to scene output directory.
Tag used in saved metric filenames for distinguishing runs.
Batch run: eval/cam_pose/run.sh
python eval/cam_pose/launch.py \
--dataset_type tum \
--model_name causalvggt --base_model stream3r \
--mode stac --streamingCommand Line Arguments for eval/cam_pose/launch.py
Model variant. Default: causalvggt.
Backbone to wrap: stream3r or streamvggt.
Directory to save evaluation results. Default: eval_results/all.
Dataset to evaluate. Required. Choices: sintel, tum, scannet.
Specific scene(s) to evaluate. Default: all scenes.
Input resolution (long side). Choices: 224, 512, 518. Default: 518.
Stride for pose evaluation. Default: 1.
Attention mode. Default: stac. Choices: stac, full, causal, window_kv, window_chunk_merge.
Batch run: eval/video_depth/run.sh
python eval/video_depth/launch.py \
--eval_dataset sintel \
--model_name causalvggt --base_model stream3r \
--mode stac --streaming
python eval/video_depth/eval_depth.py --align scaleNote: run eval_depth.py after launch.py to compute depth metrics with scale alignment.
Command Line Arguments for eval/video_depth/launch.py
Model variant. Default: causalvggt.
Backbone to wrap: stream3r or streamvggt.
Directory to save evaluation results. Default: empty (current directory).
Dataset to evaluate. Default: sintel. Choices depend on dataset_metadata.
List of specific sequences to evaluate. Default: all.
Input resolution (long side). Default: 518.
Stride for pose evaluation. Default: 1.
Attention mode. Default: stac. Choices: stac, full, causal, window_kv, window_chunk_merge.
Environment variables: VERBOSE=1 prints per-frame KV stats; MERGER_MEM_PROFILE=1 reports CUDA memory fragmentation during cleanup.
STAC builds upon the following excellent open-source projects and we encourage you to check them out:
VGGT | STream3R | StreamVGGT | CUT3R | Spann3R
