You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: validate sequences with the temporal model before triangulation (#615)
* feat(sequences): add temporal_model_score and is_validated columns
* feat(temporal): add temporal model API client with circuit breaker
* feat(detections): gate triangulation on risk+temporal validation via background task
* test(temporal): cover temporal service and mark seeded sequences validated
* style(detections): order imports and drop redundant casts
* test(temporal): cover validation pipeline and update sequence fixtures
* fix(temporal): validation-first scheduling, scoreless guard, atomic claim
Address review: schedule validation before fallible notification tasks so a
failing webhook can't skip triangulation; raise on temporal call failure to
distinguish it from a scoreless 200 (no longer fails open on uncalibrated
responses); claim is_validated atomically so concurrent tasks can't
double-triangulate; guard the Slack call defensively.
* build: wire TEMPORAL_API_URL and TEMPORAL_MODEL_THRESHOLD into backend env
* fix(temporal): configurable timeout (default 30s) and log real error
The model infers serially, so concurrent validation calls queue and the old
10s timeout tripped the breaker under bursts. Make the timeout a setting
(TEMPORAL_API_TIMEOUT, default 30s) and log the exception repr so timeouts
aren't masked as an empty message.
* fix(temporal): bound in-flight calls with a semaphore so bursts queue
The model infers serially; many concurrent validation calls were piling up and
timing out, tripping the breaker and losing scores. Bound concurrency with a
configurable semaphore (TEMPORAL_API_MAX_CONCURRENCY, default 1) so bursts queue
and every sequence is scored. Fail-open stays only for genuine unavailability.
* refactor(temporal): never hold a DB session across the temporal call
Split validate_sequence into phases (read+gate / temporal call with no session /
persist+triangulate+notify) so queued calls don't hold a DB connection and
exhaust the pool under bursts. Also re-check the breaker inside the semaphore
(it may open while a call waits in the queue), and document the per-process
concurrency caveat for multi-worker deploys.
* fix(temporal): address review - no fail-open resurrection, isolate task failures
- _temporal_verdict drops sequences past MAX_FRAMES before the availability
check, so a model-declined sequence can't be resurrected by a later fail-open
- wrap validate_sequence in a catch-all so one task's failure doesn't abort
sibling validation/notification tasks, and a post-claim failure is logged loudly
- note the temporal API is called without credentials (assumes private network)
- test the append path schedules validation, plus the hard-drop regression
* docs(temporal): record sign-off that sub-MIN_FRAMES sequences are intentionally suppressed
* feat(temporal): send bearer token when TEMPORAL_API_TOKEN is set
The temporal API now guards /predict with a shared bearer token
(pyronear/temporal-model#35). Send Authorization: Bearer when configured;
unset matches a server with auth disabled (open /predict).
* fix(temporal): release the validation claim when alert attachment fails
If _attach_sequence_to_alert raised after claim_validation committed, the
catch-all wrapper swallowed the error and the sequence stayed validated-but-
never-alerted forever (the early is_validated check blocks any retry). Roll
the flag back on failure (fresh session) so a later detection retries the
whole pipeline; attachment is idempotent on retry.
* feat(temporal): drop the sliding window, send all frames (4-10)
Send every distinct frame of the sequence instead of the last 6; the model
accepts up to 10 frames and benefits from the full history.
* feat(temporal): scope /predict to the sequence bbox region via roi_xyxyn
Send the union envelope of the sequence's primary bboxes as roi_xyxyn
(pyronear/temporal-model#41) so the verdict can't be polluted by unrelated
activity elsewhere in the frame. Omitted when no bbox parses (full-frame).
* fix(sequences): wildfire label validates and attaches a rejected sequence
A temporal-rejected sequence stays visible in the platform listings but had no
alert, and labeling it wildfire_smoke returned without creating one - manually
correcting a temporal false negative did nothing. Treat the human confirmation
as validation: claim the sequence and run the usual alert attachment.
* Revert "fix(sequences): wildfire label validates and attaches a rejected sequence"
This reverts commit 0e045a0.
* feat(sequences): DB-backed validation job state (due marker, lease, status)
The validation queue moves into Postgres so it works with any number of
uvicorn workers: validation_due_at is the queue (one entry per sequence,
COALESCE keeps FIFO order), validation_lease_until guards a claimed job
(FOR UPDATE SKIP LOCKED + lease; an expired lease means the worker died
and the job is up for grabs), validation_status traces how validation
concluded (model / fail_open_unavailable / fail_open_stale /
window_exhausted). finish_validation_job only clears the due marker if
the frame set is unchanged, so frames that arrive during scoring trigger
a re-run instead of being lost. Job fields are excluded from API
responses (re-declared on SequenceRead to reset instrumented defaults).
* feat(temporal): breaker with exponential backoff, half-open probe, 4xx exemption
Replaces the fixed 4h blackout: the breaker now opens for 60s doubled on
each re-trip (capped at 1h, with jitter), and once the pause elapses the
next call is a half-open probe — success closes the breaker, failure
re-opens it immediately at the next tier. 4xx responses fail the call
(callers still fail open) but no longer trip the breaker: they signal a
config/input problem, not an outage. The client-side semaphore is gone —
the temporal API serializes inference server-side, and each pyro-api
process runs a single validation worker.
* feat(validation): DB-coordinated worker replaces per-detection background tasks
Each detection now just marks its sequence due (idempotent enqueue); a
per-process worker loop claims due sequences from Postgres and runs the
risk + temporal + triangulation + Slack pipeline. Multi-worker safe by
construction: SKIP LOCKED + lease prevent duplicate model calls, the
queue survives restarts, and a crashed worker's job is reclaimed when
its lease expires.
Frames are read at scoring time, so backlogged sequences are scored with
their freshest frame set. Window rules: past MAX_FRAMES with a prior
score is terminal (window_exhausted, the model declined its chances);
past MAX_FRAMES never scored sends the last 10 — a backlog delays
sequences, never silently drops them. A job queued past
TEMPORAL_VALIDATION_MAX_AGE fails open explicitly (fail_open_stale) to
bound validation latency, and job errors release the lease with a 30s
retry backoff instead of spinning.
New settings: TEMPORAL_VALIDATION_POLL_SECONDS, TEMPORAL_VALIDATION_MAX_AGE,
TEMPORAL_VALIDATION_LEASE_SECONDS; TEMPORAL_API_MAX_CONCURRENCY removed.
* feat(validation): gate webhooks and Telegram behind validation like Slack
All notification channels now sit behind the temporal validation gate
and fire exactly once per sequence (atomic claim), carrying the
sequence's latest detection at validation time. A sequence the model
declines never notifies anyone — no channel bypasses the gate anymore.
create_detection no longer notifies at sequence creation.
* fix(validation): crash-safe post-claim work, score-aware window rule, fresh reads, per-org alert lock
Addresses the final review findings:
- A worker dying after winning the validation claim (deploy cancellation,
process death) left a validated-but-due row no one would ever reclaim.
The claim query no longer filters validated rows: a validated sequence
still due means post-claim work is pending, and the worker resumes it
(triangulate + notify + finish). The due marker is the completion
record; notifications become at-least-once (documented).
- A stored above-threshold score could decay into window_exhausted on
retry past MAX_FRAMES (scored 0.9 -> attach failed -> retried at 11+
frames -> dropped). The terminal branch now requires the stored score
to be a decline; an above-threshold score validates without another
model call.
- The worker re-reads the sequence at the start of the job instead of
trusting the claim-time snapshot (stale max_conf could mis-apply the
risk gate; stale score/validated flag broke the rules above).
- Alert attachment is serialized per organization with a Postgres
advisory lock so two workers validating two sequences of the same
event can't both create an alert (race pre-existed this PR).
- Doc fixes: at-least-once delivery, risk-gate retry semantics, stale
overlap.py comment, frame-count-proxy limitation.
* refactor(validation): simplify worker plumbing
- The job pipeline takes only the sequence id and reads ALL state fresh
from the DB, making the stale-snapshot class of bug impossible by
construction (the claim-time object is no longer passed around).
- validation_status merges into finish_validation_job (one UPDATE
instead of two); set_validation_status and _conclude_terminal removed.
- A 4xx from the temporal API now closes the breaker outright (the API
answered, it is reachable) instead of only skipping the failure count.
- enqueue guard uses IS DISTINCT FROM instead of OR/IS NULL.
* fix(validation): attach failure keeps the verdict; status written at claim
A failed alert attachment no longer flips is_validated back: now that
validated-but-due rows are resumed, releasing the claim could erase a
reached verdict (a fail-open or positive-score validation retried later
could be re-decided differently and dropped). The verdict and its
validation_status are written atomically by claim_validation; a failure
in the post-claim work releases only the lease and the retry resumes
attach + notify with the verdict intact. release_validation removed.
Also documents (comment + test) that the stale fail-open deliberately
does not apply below MIN_FRAMES: a short sequence isn't waiting on the
model — if its job went stale, the sequence stopped emitting (noise).
* test(validation): cover failure paths and edge cases (100% on validation.py)
- notification channel failures (webhook, Telegram, Slack) never abort
the job; no-detection sequences notify nothing
- vanished sequence/camera at processing time complete the job quietly
- in-flight TemporalUnavailableError fails open
- lost claim (concurrent validation) completes without double attach
- unparseable bbox and degenerate (zero-area) ROI fall back to full-frame
- worker loop survives errors, idles between polls, stops on cancellation
* fix(validation): detached notifications, dead-letter cap, shared status constants
Addresses the second review round:
- Notifications (webhooks, Telegram, Slack) now fire as a detached task
AFTER job completion, off the worker's critical path: a slow or hanging
channel no longer delays the next sequence's scoring nor eats into the
lease. Accepted flip side (documented): a crash between completion and
delivery loses the notification — consistent with the channels being
best-effort, the durable artifact is the alert row.
- Errored jobs no longer retry forever: validation_attempts counts
consecutive errors (reset on every completed job) and at
MAX_VALIDATION_ATTEMPTS (5) the job dead-letters as
validation_status='failed' (terminal, never re-enqueued) so a poison
job can't starve the serial worker. The staleness fail-open could not
bound this since each retry refreshes the due time.
- validation_status values now have a single source of truth in
app.models (worker and CRUD queue guards import the same constants)
and are enforced by a DB CHECK constraint.
* fix(validation): close three state-machine races found in final review
- A loser of the validation claim no longer completes the job: the
winner may have died right after the flip (lease expired mid-model-
call, sibling reprocessed and crashed), and the due marker is the only
thing that resumes its attach/notify work. Leaving the row due is
always safe: a finished winner already cleared it.
- A never-scored sequence past MAX_FRAMES whose first (truncated) score
declines now lands in window_exhausted immediately, consistent with
the stored-score rule, instead of lingering unlabeled.
- The risk-gate finish is now conditional on the frame count (read
before the gate): a max_conf bump landing mid-job keeps the job due
so the gate re-evaluates immediately instead of losing the bump until
the next detection.
* feat(validation): store model and serving-code versions with the temporal score
The temporal API now reports a version block ({api, model}) in /predict
responses (pyronear/temporal-model#48). Persist both next to
temporal_model_score so every stored score stays attributable to the
exact model release and serving image after redeploys: combined with
the is_wildfire annotations this enables per-version offline evaluation
and threshold tuning on production traffic.
The triple is written in a single UPDATE (it moves together by
construction); columns are NULL for never-scored sequences and for
unstamped (non-release) serving builds, and older servers without the
version block degrade to NULL.
@router.post("/", status_code=status.HTTP_201_CREATED, summary="Register a new wildfire detection")
351
344
asyncdefcreate_detection(
352
-
background_tasks: BackgroundTasks,
353
345
bboxes: str=Form(
354
346
...,
355
347
description="string representation of list of detection localizations, each represented as a tuple of relative coords (max 3 decimals) in order: xmin, ymin, xmax, ymax, conf",
0 commit comments