Skip to content

feat(tube-multiscale-fusion): two-branch multiscale temporal smoke classifier#73

Open
rensortino wants to merge 13 commits into
pyronear:mainfrom
rensortino:renato/tube-multiscale-fusion
Open

feat(tube-multiscale-fusion): two-branch multiscale temporal smoke classifier#73
rensortino wants to merge 13 commits into
pyronear:mainfrom
rensortino:renato/tube-multiscale-fusion

Conversation

@rensortino

@rensortino rensortino commented May 26, 2026

Copy link
Copy Markdown

Summary

New experiment under experiments/temporal-models/tube-multiscale-fusion/ — a
two-branch temporal smoke classifier that pairs a global DINOv2 sequence
context branch with a local spatio-temporal tube transformer, fused via
cross-attention.

Motivation

Smoke and its hardest distractors (clouds, fog, haze, dust) look near-identical
in any single frame; what separates them is how they move, and they move
differently at different scales:

  • Global branch (low-frequency, long time): DINOv2 embeds each of 16 frames;
    a small transformer aggregates them into one context vector capturing the
    overall shape evolution — static/drifting (cloud, fog) vs. growing/rising
    (plume).
  • Local branch (high-frequency, short time): each 224×224 bbox patch is
    decomposed into a grid of spatial cells tracked over short, overlapping
    4-frame windows ("tubes"). A per-tube video transformer reads the local motion
    signature — turbulent/high-variance for smoke, smooth/coherent for fog/cloud —
    that a single global vector smooths away.
  • Fusion: local tubes self-attend, then the global vector acts as the
    query in cross-attention over the tubes, weighting high-frequency local
    evidence with long-range context.

Architecture

patches (B, 16, 3, 224, 224) + mask
   ├─ Global: DINOv2 ViT-S/14 per frame → transformer aggregator → (B, 384)
   └─ Local:  extract_tubes (grid × overlapping windows) → tubelet Conv3d +
              self-attn transformer per tube → (B, N_tubes, 384) + validity mask
   → Fusion (self-attn on tubes + cross-attn global=Q/locals=KV) → (B, 384)
   → MLP head → logit

Default geometry (winner of the included resolution sweep): 2×2 grid of
112×112 cells × 4-frame windows at stride 2 → 28 tubes/seq
.

Results (val, 280 tubes, single seed)

variant grid / len / stride tubes/seq val F1 val acc val prec val rec val PR-AUC
default (2×2) 2×2 / 4 / 2 28 0.9783 0.9786 0.9574 1.0000 0.9936
spatial_4x4 4×4 / 4 / 2 112 0.9673 0.9679 0.9500 0.9852 0.9874
spatial_8x8 8×8 / 4 / 2 448 0.9745 0.9750 0.9571 0.9926 0.9942
temporal_stride1 2×2 / 4 / 1 52 0.9781 0.9786 0.9640 0.9926 0.9954
temporal_len8 2×2 / 8 / 4 12 0.9744 0.9750 0.9638 0.9852 0.9939

Confusion matrix (default 2×2): TN=139, FP=6, FN=0, TP=135.

Benchmarking (GUIDELINES.md metrics, RTX 4090 / 24-thread CPU)

Metric Value
Recall @ FPR = 1% / 5% / 10% 0.874 / 1.000 / 1.000
Time-to-detection (median) 2 frames ≈ 30 s after tube start (135/135 eventually fire)
Inference latency (GPU) 10.7 ms/seq → 0.67 ms/frame
Inference latency (CPU) 290 ms/seq → 18.2 ms/frame
Model size 36.4 M params (16.6 M trainable, 19.9 M frozen DINOv2)
FLOPs 226.5 GFLOPs/seq → 14.2 GFLOPs/frame

What's included

  • src/tube_multiscale_fusion/ — modular global_branch, local_branch,
    fusion, classifier, and the LitTubeMultiscaleClassifier Lightning module.
    Upstream tube-building / patch-cropping primitives reused from the parent.
  • scripts/train, evaluate, benchmark, and package_model
    (bundles checkpoint + exact params + manifest into a portable .zip).
  • DVC pipelineprepare → truncate → build_tubes → build_model_input → train → evaluate → package → benchmark, plus a resolution-sweep matrix.
  • Tests — 56 unit tests (shape, masking, tube extraction, gradient flow,
    fusion); 2 @slow tests behind a marker for the real DINOv2 download.
  • README — architecture diagram, motivation, sweep results, benchmarking,
    and reproduction steps.

How to reproduce

cd experiments/temporal-models/tube-multiscale-fusion
uv sync
# import data (pinned pyro-dataset v2.2.0) — see README "Data"
uv run dvc repro

Notes for reviewers

  • Data: data/01_raw/datasets_full/ is gitignored; on a clean checkout it
    comes from the dvc import commands in the README (pinned to v2.2.0), not
    from the DVC remote.
  • DVC remote: s3://pyro-vision-rd/dvc/experiments/tube-multiscale-fusion/.
    Artifacts (checkpoints, model package, reports) are pushed via dvc push;
    benchmark.json is a cache: false metric committed directly to git.
  • Single-seed results; no multi-seed CI run yet. The small val set (280) means
    early-stopping on val/f1 saturates quickly — see README for the schedule.

rensortino and others added 3 commits May 26, 2026 23:35
…assifier

New experiment under experiments/temporal-models/tube-multiscale-fusion/.
Architecture: a global DINOv2 sequence transformer (16 frames -> single
384-d context vector) combined with a local spatio-temporal tube
transformer that decomposes each 224x224 bbox patch into a configurable
grid of cells over overlapping 4-frame windows. The two branches are
fused via self-attention over local tube tokens and cross-attention with
the global vector as query.

Default geometry (winner of the included resolution sweep): 2x2 spatial
grid of 112x112 cells x 4-frame windows at stride 2 -> 28 tubes/seq.

Val (280 sequences, single-seed):
  default 2x2:      F1=0.978  acc=0.979  prec=0.957  rec=1.000  PR-AUC=0.994
  spatial_4x4:      F1=0.967
  spatial_8x8:      F1=0.975
  temporal_stride1: F1=0.978  PR-AUC=0.995
  temporal_len8:    F1=0.974

Matches the bbox-tube-motion-fusion DINOv2+motion baseline (F1=0.978) on
val with no precomputed motion features.

Includes:
- src/: modular global/local/fusion components + Lightning module
- scripts/: train, evaluate, package_model (writes a portable .zip
  containing the checkpoint + the exact params used to train it)
- DVC pipeline: prepare/truncate/build_tubes/build_model_input/train/
  evaluate/package + a resolution sweep matrix
- tests/: 56 unit tests covering shape, masking, gradient flow, fusion
- README.md: architecture diagram, sweep results, reproduction steps

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Chouffe

Chouffe commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Really excited about this approach 🥳 — explicitly modelling the spatio-temporal signal at two scales (global low-frequency sequence context + local high-frequency tube motion) is exactly the kind of "how does it move" cue that should separate smoke from clouds/fog/haze, and the cross-attention fusion is a clean way to weight the two. Great direction!

Really nice PR overall too 👏🏿 .
Clean modular split (global_branch / local_branch / fusion / classifier / lit_module), genuinely strong test coverage, committed reproducible metrics, and it follows repo conventions.

Findings below are mostly polish — one CI blocker.

🔴 Blocker — ruff format --check fails (CI will go red)

With the locked ruff 0.15.7, four files need reformatting:

Would reformat: scripts/benchmark.py
Would reformat: scripts/evaluate.py
Would reformat: tests/test_classifier.py
Would reformat: tests/test_local_branch.py

Fix:

make format

🟠 Should fix

1. Dead prepare stage + unused heavy deps. scripts/prepare.py downloads YOLO weights into data/01_raw/models, but nothing consumes them — build_tubes.py reads label .txt files directly and its docstring says "No YOLO inference is performed." No DVC stage depends on the prepare output. Correspondingly ultralytics, opencv-python-headless, pandas, and pydantic are declared in pyproject.toml but never imported anywhere in src/, scripts/, or tests/ (ultralytics especially is a heavy footprint). Suggest dropping the prepare stage + these deps, or wiring the detector in if it's actually intended.

2. Tube-building algo is now in the lib — this experiment forks it. The canonical tube-building implementation now lives in lib/bbox-tube-temporal/ (package bbox-tube-temporal-core): compute_iou, match_detections, build_tubes, interpolate_gaps, select_longest_tube, tube_from_record — exactly the API re-implemented here in src/tube_multiscale_fusion/tubes.py (and data.py / model_input.py / types.py look similarly duplicated). The lib version is also ahead (e.g. merge_colocated_tubes). Per "shared code goes under lib/ with proper packaging," this experiment should depend on bbox-tube-temporal-core and import from it rather than carry a fork that will drift.

3. No pyrocore integration. pyrocore is a declared dependency (with a [tool.uv.sources] path) but is never imported. The convention is for each experiment to expose a TemporalModel subclass so it plugs into the shared eval interface / temporal-model-explorer — without an adapter this model can't be compared alongside the others. At minimum worth a tracked follow-up; ideally a thin wrapper around LitTubeMultiscaleClassifier.

🟡 Minor / nits

  • Copy-paste docstrings referencing the wrong architecture:
    • tubes.py module docstring mentions "the LSTM's need for temporally-ordered features" — there's no LSTM.
    • augment.py TemporalTubeTransform mentions "pack_padded_sequence (used by the GRU head)" — no GRU / pack_padded_sequence. The prefix-compaction is correct and still needed; just the stated rationale is stale.
  • Dataset version — inconsistent + now outdated: PR body & README say pinned pyro-dataset v2.2.0, but data.py's list_sequences docstring says "nested pyro-dataset v3.0.0 layout." Which is authoritative? Also note pyro-dataset v4.0.0 is now available — worth deciding whether to adopt it (results would need a re-run) or at least documenting why this experiment pins an older version.

💡 Design suggestions (follow-up, not blocking)

The architecture is great but feels heavier than the current benchmark justifies. Three ideas, in priority order, for probing whether the spatial encoding can be much simpler:

1. Ablate the local branch before optimizing it. This is the highest-value experiment, and it's nearly free — you already have the global branch. Run global-branch-only (DINOv2 + temporal aggregator → head, no local branch, no fusion) on the same split. Two signals say the local branch may not be earning its complexity: the metrics are already near-saturated (F1 0.978, recall@FPR=5% = 1.0, 0 false negatives) on a small single-seed val set, and the resolution sweep shows 2×2 / 28 tubes beat 8×8 / 448 tubes — i.e. adding spatial granularity hurt, the opposite of what you'd expect if fine local structure were the discriminative signal. If global-only lands within noise of the full model, the entire local branch + fusion is removable. If it doesn't, inspect which sequences global-only gets wrong (presumably the fog/cloud/haze distractors) and let those concrete failure cases dictate the minimal local encoder — rather than carrying the full tube transformer on the assumption it's needed.

2. If local motion does help, reuse the DINOv2 patch tokens instead of a second encoder. The global branch already runs DINOv2 on every frame but keeps only the CLS token (global_pool="token") and throws away the patch-token grid (16×16 for ViT-S/14 @ 224). That grid is a spatial decomposition — for free, and at far better quality than a small Conv3d+transformer trained from scratch on a tiny dataset. Concretely: switch the backbone to return patch tokens (B, T, 256, 384), then capture local motion with a cheap temporal op per patch location — e.g. temporal std-dev/variance over the T axis (directly encodes "turbulent smoke vs smooth fog"), or a tiny shared temporal MLP/attention per location. This deletes local_branch.py's tube extraction and the tubelet encoder while keeping the multi-scale intuition, and it removes the awkward fact that the model currently encodes the same pixels twice (once via DINOv2, once via the raw-pixel tube encoder). The 2×2-beats-8×8 result also suggests you can pool that patch grid down aggressively without losing anything.

3. Replace cross-attention fusion with pooling + concat. Independent of the above, a single global query cross-attending ~28 tubes (FusionModule, two layers of self-attn + cross-attn + FFN) is a lot of machinery for "summarize the local evidence and combine it with the global vector." Mean- and/or max-pool the local embeddings over the (masked) tube axis and concat([global_vec, pooled_local]) → MLP head is a strong, much smaller baseline — and worth running as its own ablation against the cross-attention version to confirm the attention is actually buying accuracy. This drops fusion.py entirely.

Net "simplest viable" to try alongside the current model: global DINOv2 sequence branch + temporal-variance pooling over its own patch-token grid + concat → MLP — same two-scale story, but no tube extraction, no Conv3d encoder, no cross-attention.

rensortino and others added 10 commits June 2, 2026 11:02
Ablate the temporal module (global DINOv2 sequence branch + cross-attention
fusion): keep only the local tube decomposition, swap the tubelet-transformer
tube encoder for a Kinetics-400 pretrained r3d_18, and mask-mean-pool the
per-tube vectors into an MLP head. Same tube geometry (2x2 / len 4 / stride 2),
schedule, and data as the full default model.

Val (280 tubes), full vs ablation:
  F1        0.978 -> 0.933
  precision 0.957 -> 0.887   (FP 6 -> 17)
  recall    1.000 -> 0.985
  R@FPR=1%  0.874 -> 0.659
  FLOPs/seq 226.5 -> 603.9 GFLOPs

Removing the global context branch hurts precision and the low-false-alarm
regime far more than recall: a local 3D-CNN still finds smoke but can no longer
reliably reject slow look-alikes (cloud/fog) without long-range context — and
costs 2.7x the FLOPs to do worse. Validates the two-branch design.

Adds local_resnet3d.py, lit_ablation.py, train_ablation.py, evaluate_ablation.py,
params block, DVC train/evaluate stages, and a README ablation section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…temporal module

The previous ablation conflated two changes — removing the global branch AND
swapping the local encoder for a torchvision r3d_18 — which ran r3d over all 28
tubes and made the ablation MORE expensive than the full model (604 vs 226
GFLOPs). That is not a valid ablation.

Replace it with a faithful one: keep the local branch and fusion module exactly
(same modules, same hyperparameters); remove only the global DINOv2 sequence
branch, substituting a learnable query token for its context vector into the
fusion cross-attention. Removing the 16 per-frame DINOv2 passes makes the
ablation strictly smaller and cheaper, as an ablation should be.

Val (280 tubes), full vs ablation (no temporal module):
  F1        0.978 -> 0.872
  precision 0.957 -> 0.801   (FP 6 -> 32)
  recall    1.000 -> 0.956
  R@FPR=1%  0.874 -> 0.193
  params    36.4M -> 11.3M
  GFLOPs    226.5 -> 30.5   (~7x cheaper)

The global temporal branch is what buys the deployment-critical low-FPR
operating point: without it the local branch still finds smoke but cannot
reject slow look-alikes (cloud/fog), so precision and recall@1%FPR collapse.

Removes local_resnet3d.py + the r3d ablation stages; adds ablation_classifier.py
(AblationNoTemporalClassifier) and rewires lit_ablation / scripts / params /
dvc.yaml / README accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two more single-component ablations alongside the existing no-temporal one,
and write up the full study in ABLATIONS.md.

New variants (same data/seed/schedule/geometry; only the named part changes):
- no_spatial:    remove the local tube branch (global DINOv2 branch only).
- weighted_mean: replace the cross-attention fusion with a learned weighted
                 mean over tubes + a learned gate between branches (no attention).

Val (280 tubes) F1 / R@FPR=1%:
  full           0.978 / 0.874
  weighted_mean  0.978 / 0.837    (ties full; higher precision, fewer FP)
  no_spatial     0.971 / 0.770
  no_temporal    0.872 / 0.193    (catastrophic — global context is essential)

Findings: the temporal/global module dominates (+0.106 F1); the spatial/local
module adds a modest gain concentrated at the strict low-FPR operating point
(+0.007 F1, R@1%FPR 0.770->0.874); the attention-based fusion is statistically
indistinguishable from a weighted mean here — a simplification candidate.

Adds WeightedMeanFusion, AblationNoSpatial/WeightedMean classifiers,
LitAblationGlobal, train/evaluate_ablation_global scripts, params + DVC stages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OMPARISON.md

Compare 6 aggregators for the global branch's temporal module, isolated in the
global-only (no_spatial) setting: Transformer (baseline) vs LSTM, GRU, MLP,
1D CNN, and Linear+weighted-average. Adds a pluggable build_aggregator factory
(global_branch.py), threads aggregator_kind through GlobalBranch /
AblationNoSpatialClassifier / LitAblationGlobal / train script, plus params
blocks and a DVC matrix (train_temporal / evaluate_temporal).

Val (280 tubes), F1 / PR-AUC / recall@1%FPR:
  LSTM         0.975 / 0.993 / 0.867   (best F1, 0 FN)
  MLP          0.975 / 0.987 / 0.519   (good F1 but poor low-FPR; most params)
  1D CNN       0.971 / 0.990 / 0.770
  Transformer  0.971 / 0.990 / 0.770   (baseline, mid-pack)
  GRU          0.967 / 0.995 / 0.889   (best ranking; fewest params 2.76M)
  linear_wavg  0.960 / 0.991 / 0.815   (worst — no temporal mixing)

Findings: compute is identical across aggregators (~196 GFLOPs, ~8.8 ms — all
dominated by the 16 DINOv2 passes), so the choice is free; every learned mixer
lands within ~0.8 F1 points; pure weighted averaging is worst (needs real
temporal mixing); GRU/LSTM edge the Transformer at the strict operating point
with fewer params. Recommend GRU as default (re-validate end-to-end before
swapping production). Single-seed/small-val caveats documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PARISON.md

Compare 5 per-tube spatial encoders, isolated in the local-only (no_temporal)
setting: tubelet-transformer (baseline) vs 3D ResNet (Kinetics r3d_18), ViViT
(factorised attention), ConvLSTM, and TSM. Adds the encoders + build_tube_encoder
factory to local_branch.py and threads encoder_kind through LocalBranch /
AblationNoTemporalClassifier / LitAblationNoTemporal / train script; params blocks
+ DVC matrix (train_spatial / evaluate_spatial).

Val (280 tubes), F1 / PR-AUC / params / GFLOPs:
  3D ResNet     0.942 / 0.979 / 40.6M / 604   (best — only pretrained encoder)
  ViViT         0.890 / 0.924 / 14.8M / 55     (best from-scratch)
  tubelet (base)0.872 / 0.902 / 11.3M / 30
  ConvLSTM      0.858 / 0.868 /  8.7M / 73
  TSM           0.850 / 0.859 /  7.7M / 27

Findings: the spatial encoder matters far more than the temporal aggregator
(F1 spread ~9 pts vs ~0.8) because in local-only it is the whole feature
extractor. Pretraining dominates — Kinetics r3d_18 thirds the false positives
(10 vs 32) but at 20x the FLOPs. Among from-scratch encoders, ViViT's factorised
attention edges the current tubelet-transformer cheaply. Given the local branch
adds only ~0.7 F1 in the full model (see ABLATIONS.md), ViViT is the sensible
upgrade and r3d a poor compute trade unless the local branch becomes primary.
Single-seed/small-val and local-only caveats documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OMBINED_VARIATIONS.md

Cartesian product of spatial (tubelet / ViViT / 3D ResNet) x temporal
(transformer / LSTM) in the full two-branch model — 6 cells (tubelet x
transformer = existing dinov2_multiscale, reused). Threads aggregator_kind +
encoder_kind through TubeMultiscaleClassifier / LitTubeMultiscaleClassifier /
scripts/train.py; adds 5 params blocks + a DVC matrix (train_full_combo /
evaluate_full_combo).

Val (280 tubes), F1 / PR-AUC / GFLOPs:
  tubelet  x transformer  0.978 / 0.994 / 226   (default — cheapest, top F1, 0 FN)
  resnet3d x lstm         0.978 / 0.995 / 800   (best precision/PR-AUC, 3.5x FLOPs)
  vivit    x lstm         0.971 / 0.991 / 251
  resnet3d x transformer  0.968 / 0.995 / 800
  vivit    x transformer  0.968 / 0.993 / 251
  tubelet  x lstm         0.967 / 0.995 / 226

Finding: in the full model all 6 combos tie within ~1.1 F1 points (vs 9 pts when
the spatial encoder was tested in isolation) — once the global DINOv2 branch is
present it carries the prediction and the local encoder / aggregator choices are
second-order. The cheapest default (tubelet x transformer) is statistically the
best; heavier spatial encoders don't pay off. Keep the default.

Note: scripts/train.py also carries a pre-existing working-tree edit switching
EarlyStopping from val/f1(max) to val/loss(min); the 5 new combos trained under
it (ModelCheckpoint still selects best-val/f1, so reported metrics are
best-by-f1 checkpoints). Single-seed/small-val caveats documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants