Skip to content

cylc stop --now can wait for several minutes for job logs retrieval to take place in some cases #7018

@MetRonnie

Description

@MetRonnie

Description

When stopping a couple of big NWP workflows that had been running for ~ a dozen days, they took several minutes to actually shut down after the cylc stop --now command was received.

In the example below, the timing of the job logs retrieve command returning followed by the logging of "Workflow shutting down" is suspicious.

Logs

2025-10-06T14:34:04Z INFO - [20251006T0900Z/enuk_um_fcst_013/01:running] => succeeded
2025-10-06T14:34:04Z INFO - [20251006T2100Z/enuk_um_fcst_013:waiting(runahead)] => waiting
2025-10-06T14:34:14Z INFO - Command "stop" received. ID=63ffa69c-1ebd-4615-9f37-8788724fc693
    stop(mode=REQUEST(NOW))
2025-10-06T14:34:14Z INFO - Command "stop" actioned. ID=63ffa69c-1ebd-4615-9f37-8788724fc693
2025-10-06T14:34:35Z INFO - [20251006T1100Z/ukv_surf_jules_1/01:running] => succeeded
2025-10-06T14:34:35Z INFO - [20251006T1100Z/ukv_surf_ekf:waiting(runahead)] => waiting
2025-10-06T14:34:49Z INFO - [20251006T1100Z/ukv_surf_jules_0/01:running] => succeeded
2025-10-06T14:34:49Z INFO - [20251006T0900Z/enuk_um_fcst_014/01:running] => succeeded
2025-10-06T14:34:49Z INFO - [20251006T0900Z/enuk_um_fcst_end:waiting] => waiting(queued)
2025-10-06T14:34:49Z INFO - [20251006T2100Z/enuk_um_fcst_014:waiting(runahead)] => waiting
2025-10-06T14:36:52Z INFO - Command "set_verbosity" received. ID=741f94bc-c2d6-4f14-a53f-d433df32a62e
    set_verbosity(level=10)
2025-10-06T14:36:53Z DEBUG - Processing 1 queued command(s)
2025-10-06T14:36:53Z INFO - Command "set_verbosity" actioned. ID=741f94bc-c2d6-4f14-a53f-d433df32a62e
2025-10-06T14:37:37Z DEBUG - P30D inactivity timer starts NOW
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:running] (received)succeeded
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:running] => succeeded
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:succeeded] Queueing succeeded handler: os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting(runahead)] added to the n=0 window
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting(runahead)] => waiting
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_ops_odb_to_odb2_radarz:waiting] => waiting(queued)
2025-10-06T14:37:37Z INFO - [20251006T1000Z/ukv_user_hook:waiting] => waiting(queued)
2025-10-06T14:37:37Z DEBUG - [20251006T1000Z/ukv_ops_process_analysis_radarz/01:succeeded] removed from the n=0 window: completed
2025-10-06T14:37:37Z DEBUG - Runahead: base point 20251006T0900Z
2025-10-06T14:37:37Z DEBUG - Runahead: base point 20251006T0900Z
2025-10-06T14:37:37Z DEBUG - 20251006T1000Z/ukv_ops_process_analysis_radarz/01 handler:job-logs-retrieve for task event:succeeded will run after PT10S (after 2025-10-06T14:37:47Z)
2025-10-06T14:37:38Z DEBUG - os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
2025-10-06T14:37:38Z DEBUG - [(('event-handler-00', 'succeeded'), 1) cmd] os_event_hook --mode=para 'succeeded' 'os47/uk' '20251006T1000Z/ukv_ops_process_analysis_radarz' 'job succeeded'
    [(('event-handler-00', 'succeeded'), 1) ret_code] 0
    [(('event-handler-00', 'succeeded'), 1) out] [INFO] 4 arguments - task event
2025-10-06T14:37:50Z DEBUG - [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') cmd] rsync ... --include=/20251006T1000Z/ukv_ops_process_analysis_radarz ... <platform-login-node>:cylc-run/os47/uk/log/job/ ~/cylc-run/os47/uk/log/job/
    [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') ret_code] 0
    [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') out]
    receiving incremental file list
    20251006T1000Z/ukv_ops_process_analysis_radarz/
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.err
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.out
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.stats
    20251006T1000Z/ukv_ops_process_analysis_radarz/01/job.status
    sent 356 bytes  received 247,288 bytes  165,096.00 bytes/sec
2025-10-06T14:37:55Z DEBUG - [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') cmd] rsync ... --include=/20251006T0900Z/enuk_um_fcst_013 ... <platform-login-node>:cylc-run/os47/uk/log/job/ ~/cylc-run/os47/uk/log/job/
    [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') ret_code] 0
    [TaskJobLogsRetrieveContext(key='job-logs-retrieve', platform_name='exa', max_size='32M') out]
    receiving incremental file list
    20251006T0900Z/enuk_um_fcst_013/01/
    20251006T0900Z/enuk_um_fcst_013/01/job.err
    20251006T0900Z/enuk_um_fcst_013/01/job.out
    sent 221 bytes  received 16,410,397 bytes  10,940,412.00 bytes/sec
2025-10-06T14:37:55Z INFO - Workflow shutting down - REQUEST(NOW)
2025-10-06T14:37:55Z WARNING - Orphaned tasks:
    * 20251006T1000Z/enuk_um_fcst_016 (running)
    * 20251006T1000Z/enuk_um_fcst_017 (running)
    * 20251006T1000Z/enuk_um_fcst_015 (running)
    * 20251006T1100Z/ukv_ops_process_background_radarz (running)
2025-10-06T14:37:55Z DEBUG - stopping zmq replier...
2025-10-06T14:37:55Z DEBUG - ...stopped
2025-10-06T14:37:55Z DEBUG - stopping zmq publisher...
2025-10-06T14:37:55Z DEBUG - ...stopped
2025-10-06T14:37:55Z DEBUG - auth received API command b'TERMINATE'
2025-10-06T14:37:55Z INFO - platform: <platform-login-node> - remote tidy (on <platform-login-node>)
2025-10-06T14:37:55Z INFO - platform: excd - remote tidy (on <platform-login-node>)
2025-10-06T14:37:57Z DEBUG - Removing authentication keys from scheduler
2025-10-06T14:37:57Z DEBUG - os_event_hook --mode=para 'shutdown' 'os47/uk' 'REQUEST(NOW)'
2025-10-06T14:37:58Z INFO - [('workflow-event-handler-00', 'shutdown') cmd] os_event_hook --mode=para 'shutdown' 'os47/uk' 'REQUEST(NOW)'
    [('workflow-event-handler-00', 'shutdown') ret_code] 0
    [('workflow-event-handler-00', 'shutdown') out] [INFO] 3 arguments - workflow event
2025-10-06T14:37:58Z INFO - DONE

Reproducible Example

Edit: I've managed to come up with a repro:

# global.cylc
[platforms]
    [[login.remote.hpc]]
        retrieve job logs = True
        retrieve job logs retry delays = PT40S
# flow.cylc
[scheduling]
    [[graph]]
        R1 = foo => bar

[runtime]
    [[foo]]
        platform = login.remote.hpc
    [[bar]]
        script = sleep 60
        [[[events]]]
            started handlers = cylc stop %(workflow)s --now

There is an issue here where handlers that are due to be run in the future hold up the shutdown. Whereas I think the --now option x1 should only wait for already-running handlers to finish.

Metadata

Metadata

Assignees

No one assigned

    Labels

    could be betterNot exactly a bug, but not ideal.questionFlag this as a question for the next Cylc project meeting.
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions