-
Notifications
You must be signed in to change notification settings - Fork 16.8k
dagrun.duration.failed metric missing run_type tag when failure caused by dagrun_timeout #64765
Description
Bug
This is a follow-up to #29076, which fixed the missing dagrun.duration.failed
metric emission on dagrun_timeout. That fix added the Stats.timing call to
the timeout path in scheduler_job_runner.py but hardcoded:
tags={"dag_id": dag_run.dag_id}
instead of using dag_run.stats_tags.
The stats_tags property on DagRun (dagrun.py:420) is defined as:
{"dag_id": self.dag_id, "run_type": self.run_type}
The normal finish path (dagrun.py:1648) correctly uses self.stats_tags,
so run_type is present on all non-timeout failures but always absent
from timeout-caused failures.
Impact
It is impossible to filter dagrun.duration.failed by run_type in
monitoring queries when the failure was caused by dagrun_timeout. This
affects monitor accuracy for teams that want to alert only on scheduled
run failures.
Fix
One-line change in task-sdk/src/airflow/jobs/scheduler_job_runner.py
around line 2011:
# Before
tags={"dag_id": dag_run.dag_id}
# After
tags=dag_run.stats_tagsVersion
Confirmed present in 3.1.7. Likely affects all versions since #29076
was merged (2.5.2+).