Skip to content

feat: add flagger_canary_phase metric with granular phase values#1927

Open
Softer wants to merge 1 commit into
fluxcd:mainfrom
Softer:feat/canary-phase-metric
Open

feat: add flagger_canary_phase metric with granular phase values#1927
Softer wants to merge 1 commit into
fluxcd:mainfrom
Softer:feat/canary-phase-metric

Conversation

@Softer

@Softer Softer commented Jun 3, 2026

Copy link
Copy Markdown

Motivation

flagger_canary_status collapses the 11 canary phases into only 3 values (0 running, 1 successful, 2 failed). On a Grafana state-timeline this makes it impossible to distinguish WaitingPromotion, Promoting, Finalising and Succeeded — they all map to 1.

Changing flagger_canary_status would break every existing dashboard and alert, so this PR adds a new metric instead and leaves the existing one untouched.

What this PR does

Adds a new gauge flagger_canary_phase (labels: name, namespace) that exposes each phase as a unique value via a deterministic phase-to-value map:

Value Phase Value Phase
0 Initializing 6 Finalising
1 Initialized 7 Succeeded
2 Waiting 8 Failed
3 Progressing 9 Terminating
4 WaitingPromotion 10 Terminated
5 Promoting
  • SetStatus now also sets the new gauge (via SetPhase), so every existing call site is covered without changing the scheduler.
  • flagger_canary_status is not modified.
  • The Terminating (9) phase is recorded from the finalizer (for revertOnDeletion: true canaries), and Terminated (10) from the informer delete handler for any deletion. A deleted canary therefore keeps emitting a filterable value, so queries can exclude removed canaries with flagger_canary_phase < 9.

This gives a non-breaking answer to the stale-metric problem in #1029: instead of deleting metrics on canary removal (flagged as a breaking change in #1856), the phase metric exposes a terminated sentinel that dashboards/alerts can
filter on. It is also relevant to #1819 where distinguishing WaitingPromotion matters.

flagger_canary_status collapses the 11 canary phases into 3 values
(0 running, 1 successful, 2 failed), so dashboards cannot tell
WaitingPromotion, Promoting, Finalising or Succeeded apart on a
Grafana state-timeline.

Add a new flagger_canary_phase gauge that exposes each phase as a
unique value (0=Initializing ... 10=Terminated) via a deterministic
phase-to-value map. SetStatus now also sets the new gauge, so every
existing call site is covered without touching the scheduler.
flagger_canary_status is left unchanged to avoid breaking existing
dashboards and alerts.

The Terminating (9) phase is recorded from the finalizer and the
Terminated (10) phase from the informer delete handler, so deleted
canaries keep emitting a filterable value (flagger_canary_phase < 9)
instead of leaving a stale series. This addresses the stale-metric
problem from fluxcd#1029 without deleting metrics, which was flagged as a
breaking change in fluxcd#1856.

Signed-off-by: Softer <sft.nik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant