Skip to content

dagrun.duration.failed metric missing run_type tag when failure caused by dagrun_timeout #64765

@mahirhiro

Description

@mahirhiro

Bug

This is a follow-up to #29076, which fixed the missing dagrun.duration.failed
metric emission on dagrun_timeout. That fix added the Stats.timing call to
the timeout path in scheduler_job_runner.py but hardcoded:

tags={"dag_id": dag_run.dag_id}

instead of using dag_run.stats_tags.

The stats_tags property on DagRun (dagrun.py:420) is defined as:

{"dag_id": self.dag_id, "run_type": self.run_type}

The normal finish path (dagrun.py:1648) correctly uses self.stats_tags,
so run_type is present on all non-timeout failures but always absent
from timeout-caused failures.

Impact

It is impossible to filter dagrun.duration.failed by run_type in
monitoring queries when the failure was caused by dagrun_timeout. This
affects monitor accuracy for teams that want to alert only on scheduled
run failures.

Fix

One-line change in task-sdk/src/airflow/jobs/scheduler_job_runner.py
around line 2011:

# Before
tags={"dag_id": dag_run.dag_id}

# After
tags=dag_run.stats_tags

Version

Confirmed present in 3.1.7. Likely affects all versions since #29076
was merged (2.5.2+).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:metricskind:bugThis is a clearly a bugpriority:mediumBug that should be fixed before next release but would not block a release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions