feat(tube-multiscale-fusion): two-branch multiscale temporal smoke classifier#73
feat(tube-multiscale-fusion): two-branch multiscale temporal smoke classifier#73rensortino wants to merge 13 commits into
Conversation
…assifier New experiment under experiments/temporal-models/tube-multiscale-fusion/. Architecture: a global DINOv2 sequence transformer (16 frames -> single 384-d context vector) combined with a local spatio-temporal tube transformer that decomposes each 224x224 bbox patch into a configurable grid of cells over overlapping 4-frame windows. The two branches are fused via self-attention over local tube tokens and cross-attention with the global vector as query. Default geometry (winner of the included resolution sweep): 2x2 spatial grid of 112x112 cells x 4-frame windows at stride 2 -> 28 tubes/seq. Val (280 sequences, single-seed): default 2x2: F1=0.978 acc=0.979 prec=0.957 rec=1.000 PR-AUC=0.994 spatial_4x4: F1=0.967 spatial_8x8: F1=0.975 temporal_stride1: F1=0.978 PR-AUC=0.995 temporal_len8: F1=0.974 Matches the bbox-tube-motion-fusion DINOv2+motion baseline (F1=0.978) on val with no precomputed motion features. Includes: - src/: modular global/local/fusion components + Lightning module - scripts/: train, evaluate, package_model (writes a portable .zip containing the checkpoint + the exact params used to train it) - DVC pipeline: prepare/truncate/build_tubes/build_model_input/train/ evaluate/package + a resolution sweep matrix - tests/: 56 unit tests covering shape, masking, gradient flow, fusion - README.md: architecture diagram, sweep results, reproduction steps Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Really excited about this approach 🥳 — explicitly modelling the spatio-temporal signal at two scales (global low-frequency sequence context + local high-frequency tube motion) is exactly the kind of "how does it move" cue that should separate smoke from clouds/fog/haze, and the cross-attention fusion is a clean way to weight the two. Great direction! Really nice PR overall too 👏🏿 . Findings below are mostly polish — one CI blocker. 🔴 Blocker —
|
Ablate the temporal module (global DINOv2 sequence branch + cross-attention fusion): keep only the local tube decomposition, swap the tubelet-transformer tube encoder for a Kinetics-400 pretrained r3d_18, and mask-mean-pool the per-tube vectors into an MLP head. Same tube geometry (2x2 / len 4 / stride 2), schedule, and data as the full default model. Val (280 tubes), full vs ablation: F1 0.978 -> 0.933 precision 0.957 -> 0.887 (FP 6 -> 17) recall 1.000 -> 0.985 R@FPR=1% 0.874 -> 0.659 FLOPs/seq 226.5 -> 603.9 GFLOPs Removing the global context branch hurts precision and the low-false-alarm regime far more than recall: a local 3D-CNN still finds smoke but can no longer reliably reject slow look-alikes (cloud/fog) without long-range context — and costs 2.7x the FLOPs to do worse. Validates the two-branch design. Adds local_resnet3d.py, lit_ablation.py, train_ablation.py, evaluate_ablation.py, params block, DVC train/evaluate stages, and a README ablation section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…temporal module The previous ablation conflated two changes — removing the global branch AND swapping the local encoder for a torchvision r3d_18 — which ran r3d over all 28 tubes and made the ablation MORE expensive than the full model (604 vs 226 GFLOPs). That is not a valid ablation. Replace it with a faithful one: keep the local branch and fusion module exactly (same modules, same hyperparameters); remove only the global DINOv2 sequence branch, substituting a learnable query token for its context vector into the fusion cross-attention. Removing the 16 per-frame DINOv2 passes makes the ablation strictly smaller and cheaper, as an ablation should be. Val (280 tubes), full vs ablation (no temporal module): F1 0.978 -> 0.872 precision 0.957 -> 0.801 (FP 6 -> 32) recall 1.000 -> 0.956 R@FPR=1% 0.874 -> 0.193 params 36.4M -> 11.3M GFLOPs 226.5 -> 30.5 (~7x cheaper) The global temporal branch is what buys the deployment-critical low-FPR operating point: without it the local branch still finds smoke but cannot reject slow look-alikes (cloud/fog), so precision and recall@1%FPR collapse. Removes local_resnet3d.py + the r3d ablation stages; adds ablation_classifier.py (AblationNoTemporalClassifier) and rewires lit_ablation / scripts / params / dvc.yaml / README accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two more single-component ablations alongside the existing no-temporal one,
and write up the full study in ABLATIONS.md.
New variants (same data/seed/schedule/geometry; only the named part changes):
- no_spatial: remove the local tube branch (global DINOv2 branch only).
- weighted_mean: replace the cross-attention fusion with a learned weighted
mean over tubes + a learned gate between branches (no attention).
Val (280 tubes) F1 / R@FPR=1%:
full 0.978 / 0.874
weighted_mean 0.978 / 0.837 (ties full; higher precision, fewer FP)
no_spatial 0.971 / 0.770
no_temporal 0.872 / 0.193 (catastrophic — global context is essential)
Findings: the temporal/global module dominates (+0.106 F1); the spatial/local
module adds a modest gain concentrated at the strict low-FPR operating point
(+0.007 F1, R@1%FPR 0.770->0.874); the attention-based fusion is statistically
indistinguishable from a weighted mean here — a simplification candidate.
Adds WeightedMeanFusion, AblationNoSpatial/WeightedMean classifiers,
LitAblationGlobal, train/evaluate_ablation_global scripts, params + DVC stages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OMPARISON.md Compare 6 aggregators for the global branch's temporal module, isolated in the global-only (no_spatial) setting: Transformer (baseline) vs LSTM, GRU, MLP, 1D CNN, and Linear+weighted-average. Adds a pluggable build_aggregator factory (global_branch.py), threads aggregator_kind through GlobalBranch / AblationNoSpatialClassifier / LitAblationGlobal / train script, plus params blocks and a DVC matrix (train_temporal / evaluate_temporal). Val (280 tubes), F1 / PR-AUC / recall@1%FPR: LSTM 0.975 / 0.993 / 0.867 (best F1, 0 FN) MLP 0.975 / 0.987 / 0.519 (good F1 but poor low-FPR; most params) 1D CNN 0.971 / 0.990 / 0.770 Transformer 0.971 / 0.990 / 0.770 (baseline, mid-pack) GRU 0.967 / 0.995 / 0.889 (best ranking; fewest params 2.76M) linear_wavg 0.960 / 0.991 / 0.815 (worst — no temporal mixing) Findings: compute is identical across aggregators (~196 GFLOPs, ~8.8 ms — all dominated by the 16 DINOv2 passes), so the choice is free; every learned mixer lands within ~0.8 F1 points; pure weighted averaging is worst (needs real temporal mixing); GRU/LSTM edge the Transformer at the strict operating point with fewer params. Recommend GRU as default (re-validate end-to-end before swapping production). Single-seed/small-val caveats documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PARISON.md Compare 5 per-tube spatial encoders, isolated in the local-only (no_temporal) setting: tubelet-transformer (baseline) vs 3D ResNet (Kinetics r3d_18), ViViT (factorised attention), ConvLSTM, and TSM. Adds the encoders + build_tube_encoder factory to local_branch.py and threads encoder_kind through LocalBranch / AblationNoTemporalClassifier / LitAblationNoTemporal / train script; params blocks + DVC matrix (train_spatial / evaluate_spatial). Val (280 tubes), F1 / PR-AUC / params / GFLOPs: 3D ResNet 0.942 / 0.979 / 40.6M / 604 (best — only pretrained encoder) ViViT 0.890 / 0.924 / 14.8M / 55 (best from-scratch) tubelet (base)0.872 / 0.902 / 11.3M / 30 ConvLSTM 0.858 / 0.868 / 8.7M / 73 TSM 0.850 / 0.859 / 7.7M / 27 Findings: the spatial encoder matters far more than the temporal aggregator (F1 spread ~9 pts vs ~0.8) because in local-only it is the whole feature extractor. Pretraining dominates — Kinetics r3d_18 thirds the false positives (10 vs 32) but at 20x the FLOPs. Among from-scratch encoders, ViViT's factorised attention edges the current tubelet-transformer cheaply. Given the local branch adds only ~0.7 F1 in the full model (see ABLATIONS.md), ViViT is the sensible upgrade and r3d a poor compute trade unless the local branch becomes primary. Single-seed/small-val and local-only caveats documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OMBINED_VARIATIONS.md Cartesian product of spatial (tubelet / ViViT / 3D ResNet) x temporal (transformer / LSTM) in the full two-branch model — 6 cells (tubelet x transformer = existing dinov2_multiscale, reused). Threads aggregator_kind + encoder_kind through TubeMultiscaleClassifier / LitTubeMultiscaleClassifier / scripts/train.py; adds 5 params blocks + a DVC matrix (train_full_combo / evaluate_full_combo). Val (280 tubes), F1 / PR-AUC / GFLOPs: tubelet x transformer 0.978 / 0.994 / 226 (default — cheapest, top F1, 0 FN) resnet3d x lstm 0.978 / 0.995 / 800 (best precision/PR-AUC, 3.5x FLOPs) vivit x lstm 0.971 / 0.991 / 251 resnet3d x transformer 0.968 / 0.995 / 800 vivit x transformer 0.968 / 0.993 / 251 tubelet x lstm 0.967 / 0.995 / 226 Finding: in the full model all 6 combos tie within ~1.1 F1 points (vs 9 pts when the spatial encoder was tested in isolation) — once the global DINOv2 branch is present it carries the prediction and the local encoder / aggregator choices are second-order. The cheapest default (tubelet x transformer) is statistically the best; heavier spatial encoders don't pay off. Keep the default. Note: scripts/train.py also carries a pre-existing working-tree edit switching EarlyStopping from val/f1(max) to val/loss(min); the 5 new combos trained under it (ModelCheckpoint still selects best-val/f1, so reported metrics are best-by-f1 checkpoints). Single-seed/small-val caveats documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
New experiment under
experiments/temporal-models/tube-multiscale-fusion/— atwo-branch temporal smoke classifier that pairs a global DINOv2 sequence
context branch with a local spatio-temporal tube transformer, fused via
cross-attention.
Motivation
Smoke and its hardest distractors (clouds, fog, haze, dust) look near-identical
in any single frame; what separates them is how they move, and they move
differently at different scales:
a small transformer aggregates them into one context vector capturing the
overall shape evolution — static/drifting (cloud, fog) vs. growing/rising
(plume).
decomposed into a grid of spatial cells tracked over short, overlapping
4-frame windows ("tubes"). A per-tube video transformer reads the local motion
signature — turbulent/high-variance for smoke, smooth/coherent for fog/cloud —
that a single global vector smooths away.
query in cross-attention over the tubes, weighting high-frequency local
evidence with long-range context.
Architecture
Default geometry (winner of the included resolution sweep): 2×2 grid of
112×112 cells × 4-frame windows at stride 2 → 28 tubes/seq.
Results (val, 280 tubes, single seed)
Confusion matrix (default 2×2): TN=139, FP=6, FN=0, TP=135.
Benchmarking (GUIDELINES.md metrics, RTX 4090 / 24-thread CPU)
What's included
src/tube_multiscale_fusion/— modularglobal_branch,local_branch,fusion,classifier, and theLitTubeMultiscaleClassifierLightning module.Upstream tube-building / patch-cropping primitives reused from the parent.
scripts/—train,evaluate,benchmark, andpackage_model(bundles checkpoint + exact params + manifest into a portable
.zip).prepare → truncate → build_tubes → build_model_input → train → evaluate → package → benchmark, plus a resolution-sweep matrix.fusion); 2
@slowtests behind a marker for the real DINOv2 download.and reproduction steps.
How to reproduce
Notes for reviewers
data/01_raw/datasets_full/is gitignored; on a clean checkout itcomes from the
dvc importcommands in the README (pinned tov2.2.0), notfrom the DVC remote.
s3://pyro-vision-rd/dvc/experiments/tube-multiscale-fusion/.Artifacts (checkpoints, model package, reports) are pushed via
dvc push;benchmark.jsonis acache: falsemetric committed directly to git.early-stopping on
val/f1saturates quickly — see README for the schedule.