feat(tube-multiscale-fusion): two-branch multiscale temporal smoke classifier by rensortino · Pull Request #73 · pyronear/vision-rd

rensortino · 2026-05-26T22:02:14Z

Summary

New experiment under experiments/temporal-models/tube-multiscale-fusion/ — a
two-branch temporal smoke classifier that pairs a global DINOv2 sequence
context branch with a local spatio-temporal tube transformer, fused via
cross-attention.

Motivation

Smoke and its hardest distractors (clouds, fog, haze, dust) look near-identical
in any single frame; what separates them is how they move, and they move
differently at different scales:

Global branch (low-frequency, long time): DINOv2 embeds each of 16 frames;
a small transformer aggregates them into one context vector capturing the
overall shape evolution — static/drifting (cloud, fog) vs. growing/rising
(plume).
Local branch (high-frequency, short time): each 224×224 bbox patch is
decomposed into a grid of spatial cells tracked over short, overlapping
4-frame windows ("tubes"). A per-tube video transformer reads the local motion
signature — turbulent/high-variance for smoke, smooth/coherent for fog/cloud —
that a single global vector smooths away.
Fusion: local tubes self-attend, then the global vector acts as the
query in cross-attention over the tubes, weighting high-frequency local
evidence with long-range context.

Architecture

patches (B, 16, 3, 224, 224) + mask
   ├─ Global: DINOv2 ViT-S/14 per frame → transformer aggregator → (B, 384)
   └─ Local:  extract_tubes (grid × overlapping windows) → tubelet Conv3d +
              self-attn transformer per tube → (B, N_tubes, 384) + validity mask
   → Fusion (self-attn on tubes + cross-attn global=Q/locals=KV) → (B, 384)
   → MLP head → logit

Default geometry (winner of the included resolution sweep): 2×2 grid of
112×112 cells × 4-frame windows at stride 2 → 28 tubes/seq.

Results (val, 280 tubes, single seed)

variant	grid / len / stride	tubes/seq	val F1	val acc	val prec	val rec	val PR-AUC
default (2×2)	2×2 / 4 / 2	28	0.9783	0.9786	0.9574	1.0000	0.9936
spatial_4x4	4×4 / 4 / 2	112	0.9673	0.9679	0.9500	0.9852	0.9874
spatial_8x8	8×8 / 4 / 2	448	0.9745	0.9750	0.9571	0.9926	0.9942
temporal_stride1	2×2 / 4 / 1	52	0.9781	0.9786	0.9640	0.9926	0.9954
temporal_len8	2×2 / 8 / 4	12	0.9744	0.9750	0.9638	0.9852	0.9939

Confusion matrix (default 2×2): TN=139, FP=6, FN=0, TP=135.

Benchmarking (GUIDELINES.md metrics, RTX 4090 / 24-thread CPU)

Metric	Value
Recall @ FPR = 1% / 5% / 10%	0.874 / 1.000 / 1.000
Time-to-detection (median)	2 frames ≈ 30 s after tube start (135/135 eventually fire)
Inference latency (GPU)	10.7 ms/seq → 0.67 ms/frame
Inference latency (CPU)	290 ms/seq → 18.2 ms/frame
Model size	36.4 M params (16.6 M trainable, 19.9 M frozen DINOv2)
FLOPs	226.5 GFLOPs/seq → 14.2 GFLOPs/frame

What's included

src/tube_multiscale_fusion/ — modular global_branch, local_branch,
fusion, classifier, and the LitTubeMultiscaleClassifier Lightning module.
Upstream tube-building / patch-cropping primitives reused from the parent.
scripts/ — train, evaluate, benchmark, and package_model
(bundles checkpoint + exact params + manifest into a portable .zip).
DVC pipeline — prepare → truncate → build_tubes → build_model_input → train → evaluate → package → benchmark, plus a resolution-sweep matrix.
Tests — 56 unit tests (shape, masking, tube extraction, gradient flow,
fusion); 2 @slow tests behind a marker for the real DINOv2 download.
README — architecture diagram, motivation, sweep results, benchmarking,
and reproduction steps.

How to reproduce

cd experiments/temporal-models/tube-multiscale-fusion
uv sync
# import data (pinned pyro-dataset v2.2.0) — see README "Data"
uv run dvc repro

Notes for reviewers

Data: data/01_raw/datasets_full/ is gitignored; on a clean checkout it
comes from the dvc import commands in the README (pinned to v2.2.0), not
from the DVC remote.
DVC remote: s3://pyro-vision-rd/dvc/experiments/tube-multiscale-fusion/.
Artifacts (checkpoints, model package, reports) are pushed via dvc push;
benchmark.json is a cache: false metric committed directly to git.
Single-seed results; no multi-seed CI run yet. The small val set (280) means
early-stopping on val/f1 saturates quickly — see README for the schedule.

…assifier New experiment under experiments/temporal-models/tube-multiscale-fusion/. Architecture: a global DINOv2 sequence transformer (16 frames -> single 384-d context vector) combined with a local spatio-temporal tube transformer that decomposes each 224x224 bbox patch into a configurable grid of cells over overlapping 4-frame windows. The two branches are fused via self-attention over local tube tokens and cross-attention with the global vector as query. Default geometry (winner of the included resolution sweep): 2x2 spatial grid of 112x112 cells x 4-frame windows at stride 2 -> 28 tubes/seq. Val (280 sequences, single-seed): default 2x2: F1=0.978 acc=0.979 prec=0.957 rec=1.000 PR-AUC=0.994 spatial_4x4: F1=0.967 spatial_8x8: F1=0.975 temporal_stride1: F1=0.978 PR-AUC=0.995 temporal_len8: F1=0.974 Matches the bbox-tube-motion-fusion DINOv2+motion baseline (F1=0.978) on val with no precomputed motion features. Includes: - src/: modular global/local/fusion components + Lightning module - scripts/: train, evaluate, package_model (writes a portable .zip containing the checkpoint + the exact params used to train it) - DVC pipeline: prepare/truncate/build_tubes/build_model_input/train/ evaluate/package + a resolution sweep matrix - tests/: 56 unit tests covering shape, masking, gradient flow, fusion - README.md: architecture diagram, sweep results, reproduction steps Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Chouffe · 2026-06-01T14:34:50Z

Really excited about this approach 🥳 — explicitly modelling the spatio-temporal signal at two scales (global low-frequency sequence context + local high-frequency tube motion) is exactly the kind of "how does it move" cue that should separate smoke from clouds/fog/haze, and the cross-attention fusion is a clean way to weight the two. Great direction!

Really nice PR overall too 👏🏿 .
Clean modular split (global_branch / local_branch / fusion / classifier / lit_module), genuinely strong test coverage, committed reproducible metrics, and it follows repo conventions.

Findings below are mostly polish — one CI blocker.

🔴 Blocker — `ruff format --check` fails (CI will go red)

With the locked ruff 0.15.7, four files need reformatting:

Would reformat: scripts/benchmark.py
Would reformat: scripts/evaluate.py
Would reformat: tests/test_classifier.py
Would reformat: tests/test_local_branch.py

Fix:

make format

🟠 Should fix

1. Dead prepare stage + unused heavy deps. scripts/prepare.py downloads YOLO weights into data/01_raw/models, but nothing consumes them — build_tubes.py reads label .txt files directly and its docstring says "No YOLO inference is performed." No DVC stage depends on the prepare output. Correspondingly ultralytics, opencv-python-headless, pandas, and pydantic are declared in pyproject.toml but never imported anywhere in src/, scripts/, or tests/ (ultralytics especially is a heavy footprint). Suggest dropping the prepare stage + these deps, or wiring the detector in if it's actually intended.

2. Tube-building algo is now in the lib — this experiment forks it. The canonical tube-building implementation now lives in lib/bbox-tube-temporal/ (package bbox-tube-temporal-core): compute_iou, match_detections, build_tubes, interpolate_gaps, select_longest_tube, tube_from_record — exactly the API re-implemented here in src/tube_multiscale_fusion/tubes.py (and data.py / model_input.py / types.py look similarly duplicated). The lib version is also ahead (e.g. merge_colocated_tubes). Per "shared code goes under lib/ with proper packaging," this experiment should depend on bbox-tube-temporal-core and import from it rather than carry a fork that will drift.

3. No pyrocore integration. pyrocore is a declared dependency (with a [tool.uv.sources] path) but is never imported. The convention is for each experiment to expose a TemporalModel subclass so it plugs into the shared eval interface / temporal-model-explorer — without an adapter this model can't be compared alongside the others. At minimum worth a tracked follow-up; ideally a thin wrapper around LitTubeMultiscaleClassifier.

🟡 Minor / nits

Copy-paste docstrings referencing the wrong architecture:
- tubes.py module docstring mentions "the LSTM's need for temporally-ordered features" — there's no LSTM.
- augment.py TemporalTubeTransform mentions "pack_padded_sequence (used by the GRU head)" — no GRU / pack_padded_sequence. The prefix-compaction is correct and still needed; just the stated rationale is stale.
Dataset version — inconsistent + now outdated: PR body & README say pinned pyro-dataset v2.2.0, but data.py's list_sequences docstring says "nested pyro-dataset v3.0.0 layout." Which is authoritative? Also note pyro-dataset v4.0.0 is now available — worth deciding whether to adopt it (results would need a re-run) or at least documenting why this experiment pins an older version.

💡 Design suggestions (follow-up, not blocking)

The architecture is great but feels heavier than the current benchmark justifies. Three ideas, in priority order, for probing whether the spatial encoding can be much simpler:

1. Ablate the local branch before optimizing it. This is the highest-value experiment, and it's nearly free — you already have the global branch. Run global-branch-only (DINOv2 + temporal aggregator → head, no local branch, no fusion) on the same split. Two signals say the local branch may not be earning its complexity: the metrics are already near-saturated (F1 0.978, recall@FPR=5% = 1.0, 0 false negatives) on a small single-seed val set, and the resolution sweep shows 2×2 / 28 tubes beat 8×8 / 448 tubes — i.e. adding spatial granularity hurt, the opposite of what you'd expect if fine local structure were the discriminative signal. If global-only lands within noise of the full model, the entire local branch + fusion is removable. If it doesn't, inspect which sequences global-only gets wrong (presumably the fog/cloud/haze distractors) and let those concrete failure cases dictate the minimal local encoder — rather than carrying the full tube transformer on the assumption it's needed.

2. If local motion does help, reuse the DINOv2 patch tokens instead of a second encoder. The global branch already runs DINOv2 on every frame but keeps only the CLS token (global_pool="token") and throws away the patch-token grid (16×16 for ViT-S/14 @ 224). That grid is a spatial decomposition — for free, and at far better quality than a small Conv3d+transformer trained from scratch on a tiny dataset. Concretely: switch the backbone to return patch tokens (B, T, 256, 384), then capture local motion with a cheap temporal op per patch location — e.g. temporal std-dev/variance over the T axis (directly encodes "turbulent smoke vs smooth fog"), or a tiny shared temporal MLP/attention per location. This deletes local_branch.py's tube extraction and the tubelet encoder while keeping the multi-scale intuition, and it removes the awkward fact that the model currently encodes the same pixels twice (once via DINOv2, once via the raw-pixel tube encoder). The 2×2-beats-8×8 result also suggests you can pool that patch grid down aggressively without losing anything.

3. Replace cross-attention fusion with pooling + concat. Independent of the above, a single global query cross-attending ~28 tubes (FusionModule, two layers of self-attn + cross-attn + FFN) is a lot of machinery for "summarize the local evidence and combine it with the global vector." Mean- and/or max-pool the local embeddings over the (masked) tube axis and concat([global_vec, pooled_local]) → MLP head is a strong, much smaller baseline — and worth running as its own ablation against the cross-attention version to confirm the attention is actually buying accuracy. This drops fusion.py entirely.

Net "simplest viable" to try alongside the current model: global DINOv2 sequence branch + temporal-variance pooling over its own patch-token grid + concat → MLP — same two-scale story, but no tube extraction, no Conv3d encoder, no cross-attention.

Ablate the temporal module (global DINOv2 sequence branch + cross-attention fusion): keep only the local tube decomposition, swap the tubelet-transformer tube encoder for a Kinetics-400 pretrained r3d_18, and mask-mean-pool the per-tube vectors into an MLP head. Same tube geometry (2x2 / len 4 / stride 2), schedule, and data as the full default model. Val (280 tubes), full vs ablation: F1 0.978 -> 0.933 precision 0.957 -> 0.887 (FP 6 -> 17) recall 1.000 -> 0.985 R@FPR=1% 0.874 -> 0.659 FLOPs/seq 226.5 -> 603.9 GFLOPs Removing the global context branch hurts precision and the low-false-alarm regime far more than recall: a local 3D-CNN still finds smoke but can no longer reliably reject slow look-alikes (cloud/fog) without long-range context — and costs 2.7x the FLOPs to do worse. Validates the two-branch design. Adds local_resnet3d.py, lit_ablation.py, train_ablation.py, evaluate_ablation.py, params block, DVC train/evaluate stages, and a README ablation section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…temporal module The previous ablation conflated two changes — removing the global branch AND swapping the local encoder for a torchvision r3d_18 — which ran r3d over all 28 tubes and made the ablation MORE expensive than the full model (604 vs 226 GFLOPs). That is not a valid ablation. Replace it with a faithful one: keep the local branch and fusion module exactly (same modules, same hyperparameters); remove only the global DINOv2 sequence branch, substituting a learnable query token for its context vector into the fusion cross-attention. Removing the 16 per-frame DINOv2 passes makes the ablation strictly smaller and cheaper, as an ablation should be. Val (280 tubes), full vs ablation (no temporal module): F1 0.978 -> 0.872 precision 0.957 -> 0.801 (FP 6 -> 32) recall 1.000 -> 0.956 R@FPR=1% 0.874 -> 0.193 params 36.4M -> 11.3M GFLOPs 226.5 -> 30.5 (~7x cheaper) The global temporal branch is what buys the deployment-critical low-FPR operating point: without it the local branch still finds smoke but cannot reject slow look-alikes (cloud/fog), so precision and recall@1%FPR collapse. Removes local_resnet3d.py + the r3d ablation stages; adds ablation_classifier.py (AblationNoTemporalClassifier) and rewires lit_ablation / scripts / params / dvc.yaml / README accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add two more single-component ablations alongside the existing no-temporal one, and write up the full study in ABLATIONS.md. New variants (same data/seed/schedule/geometry; only the named part changes): - no_spatial: remove the local tube branch (global DINOv2 branch only). - weighted_mean: replace the cross-attention fusion with a learned weighted mean over tubes + a learned gate between branches (no attention). Val (280 tubes) F1 / R@FPR=1%: full 0.978 / 0.874 weighted_mean 0.978 / 0.837 (ties full; higher precision, fewer FP) no_spatial 0.971 / 0.770 no_temporal 0.872 / 0.193 (catastrophic — global context is essential) Findings: the temporal/global module dominates (+0.106 F1); the spatial/local module adds a modest gain concentrated at the strict low-FPR operating point (+0.007 F1, R@1%FPR 0.770->0.874); the attention-based fusion is statistically indistinguishable from a weighted mean here — a simplification candidate. Adds WeightedMeanFusion, AblationNoSpatial/WeightedMean classifiers, LitAblationGlobal, train/evaluate_ablation_global scripts, params + DVC stages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…OMPARISON.md Compare 6 aggregators for the global branch's temporal module, isolated in the global-only (no_spatial) setting: Transformer (baseline) vs LSTM, GRU, MLP, 1D CNN, and Linear+weighted-average. Adds a pluggable build_aggregator factory (global_branch.py), threads aggregator_kind through GlobalBranch / AblationNoSpatialClassifier / LitAblationGlobal / train script, plus params blocks and a DVC matrix (train_temporal / evaluate_temporal). Val (280 tubes), F1 / PR-AUC / recall@1%FPR: LSTM 0.975 / 0.993 / 0.867 (best F1, 0 FN) MLP 0.975 / 0.987 / 0.519 (good F1 but poor low-FPR; most params) 1D CNN 0.971 / 0.990 / 0.770 Transformer 0.971 / 0.990 / 0.770 (baseline, mid-pack) GRU 0.967 / 0.995 / 0.889 (best ranking; fewest params 2.76M) linear_wavg 0.960 / 0.991 / 0.815 (worst — no temporal mixing) Findings: compute is identical across aggregators (~196 GFLOPs, ~8.8 ms — all dominated by the 16 DINOv2 passes), so the choice is free; every learned mixer lands within ~0.8 F1 points; pure weighted averaging is worst (needs real temporal mixing); GRU/LSTM edge the Transformer at the strict operating point with fewer params. Recommend GRU as default (re-validate end-to-end before swapping production). Single-seed/small-val caveats documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…PARISON.md Compare 5 per-tube spatial encoders, isolated in the local-only (no_temporal) setting: tubelet-transformer (baseline) vs 3D ResNet (Kinetics r3d_18), ViViT (factorised attention), ConvLSTM, and TSM. Adds the encoders + build_tube_encoder factory to local_branch.py and threads encoder_kind through LocalBranch / AblationNoTemporalClassifier / LitAblationNoTemporal / train script; params blocks + DVC matrix (train_spatial / evaluate_spatial). Val (280 tubes), F1 / PR-AUC / params / GFLOPs: 3D ResNet 0.942 / 0.979 / 40.6M / 604 (best — only pretrained encoder) ViViT 0.890 / 0.924 / 14.8M / 55 (best from-scratch) tubelet (base)0.872 / 0.902 / 11.3M / 30 ConvLSTM 0.858 / 0.868 / 8.7M / 73 TSM 0.850 / 0.859 / 7.7M / 27 Findings: the spatial encoder matters far more than the temporal aggregator (F1 spread ~9 pts vs ~0.8) because in local-only it is the whole feature extractor. Pretraining dominates — Kinetics r3d_18 thirds the false positives (10 vs 32) but at 20x the FLOPs. Among from-scratch encoders, ViViT's factorised attention edges the current tubelet-transformer cheaply. Given the local branch adds only ~0.7 F1 in the full model (see ABLATIONS.md), ViViT is the sensible upgrade and r3d a poor compute trade unless the local branch becomes primary. Single-seed/small-val and local-only caveats documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…OMBINED_VARIATIONS.md Cartesian product of spatial (tubelet / ViViT / 3D ResNet) x temporal (transformer / LSTM) in the full two-branch model — 6 cells (tubelet x transformer = existing dinov2_multiscale, reused). Threads aggregator_kind + encoder_kind through TubeMultiscaleClassifier / LitTubeMultiscaleClassifier / scripts/train.py; adds 5 params blocks + a DVC matrix (train_full_combo / evaluate_full_combo). Val (280 tubes), F1 / PR-AUC / GFLOPs: tubelet x transformer 0.978 / 0.994 / 226 (default — cheapest, top F1, 0 FN) resnet3d x lstm 0.978 / 0.995 / 800 (best precision/PR-AUC, 3.5x FLOPs) vivit x lstm 0.971 / 0.991 / 251 resnet3d x transformer 0.968 / 0.995 / 800 vivit x transformer 0.968 / 0.993 / 251 tubelet x lstm 0.967 / 0.995 / 226 Finding: in the full model all 6 combos tie within ~1.1 F1 points (vs 9 pts when the spatial encoder was tested in isolation) — once the global DINOv2 branch is present it carries the prediction and the local encoder / aggregator choices are second-order. The cheapest default (tubelet x transformer) is statistically the best; heavier spatial encoders don't pay off. Keep the default. Note: scripts/train.py also carries a pre-existing working-tree edit switching EarlyStopping from val/f1(max) to val/loss(min); the 5 new combos trained under it (ModelCheckpoint still selects best-val/f1, so reported metrics are best-by-f1 checkpoints). Single-seed/small-val caveats documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rensortino and others added 3 commits May 26, 2026 23:35

docs: update Motivation

be8dad4

docs: add benchmarking in README.md

6010bff

rensortino and others added 10 commits June 2, 2026 11:02

style: ruff formatting

3248965

feat: replace scripts dependencies with unified lib package

fb72556

build: update dependencies in dvc stages

e97348b

test(leaderboard): added tube-multiscale-fusion model to the leaderboard

38d678f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tube-multiscale-fusion): two-branch multiscale temporal smoke classifier#73

feat(tube-multiscale-fusion): two-branch multiscale temporal smoke classifier#73
rensortino wants to merge 13 commits into
pyronear:mainfrom
rensortino:renato/tube-multiscale-fusion

rensortino commented May 26, 2026 •

edited

Loading

Uh oh!

Chouffe commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rensortino commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Architecture

Results (val, 280 tubes, single seed)

Benchmarking (GUIDELINES.md metrics, RTX 4090 / 24-thread CPU)

What's included

How to reproduce

Notes for reviewers

Uh oh!

Chouffe commented Jun 1, 2026

🔴 Blocker — ruff format --check fails (CI will go red)

🟠 Should fix

🟡 Minor / nits

💡 Design suggestions (follow-up, not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rensortino commented May 26, 2026 •

edited

Loading

🔴 Blocker — `ruff format --check` fails (CI will go red)