Automatically replace GETKF analysis and rerun forecast for members whose forecast has failed. by AmandaBack-NOAA · Pull Request #1463 · NOAA-EMC/rrfs-workflow

AmandaBack-NOAA · 2026-04-23T23:33:39Z

This PR introduces the capability to automatically replace an ensemble member's analysis with the pre-GETKF background if one of the forecasts fails. It is intended only for the real-time system so it doesn't need a babysitter around the clock. For retros this is not recommended--the added tasks that only run if needed will prevent cycles from completing and hence later cycles from starting. In addition, tasks with a dependency on the fcst task will not automatically start if the fallback fcst runs instead (this doesn't affect real-time, as the only dependent task--save_for_next--uses a datadep rather than a taskdep). Note that workarounds for these limitations do exist in rocoto and can be added to rrfs-workflow if desired, but (my opinion) people running retros should probably be attentive to their failed tasks.

The PR also fixes 2 minor bugs that needed to be corrected to allow my testing:

clean.py currently tries to do math with a string; the string is now passed to the function in question wrapped in int()
ensmean upp has a taskdep on a task that does not exist. The taskdep has been revised to reflect the correct task name.

DESCRIPTION OF CHANGES:

Added variable RESILIENT_RESTART which, when set to true, will add tasks to the enkf workflow to run only in the event that a forecast task dies.
Tasks resilient_prep_ic_m## and resilient_fcst_m## are defined using the existing scripts for prep_ic and fcst to ensure consistency between the main tasks and the restart replacements. the script for prep_ic is lightly modified to be a metatask over members ONLY for the resilient restart version. Node and walltime for resilient_fcst are added to the sample workflows.
sideload/clean.py corrected to stop doing math with strings (1-line change)
rocoto_funcs/upp.py writes the correct ensmean mpassit task name when defining the ensmean upp task.

TESTS CONDUCTED:

Tested on Gaea with conus12km cycled ensembles

Confirmed the system is unaffected as long as the variable RESILIENT_RESTART is false or never defined
Confirmed that when RESILIENT_RESTART=TRUE the desired tasks are added and will run automatically when a forecast fails. Furthermore, the system runs normally when forecasts do not fail.

Machines/Platforms:

WCOSS2
- Cactus/Dogwood
- Acorn
RDHPCS
- Hera
- Jet
- Orion
- Hercules

Test cases:

ISSUE:

#1462

CONTRIBUTORS (optional):

…rmatted as a string

…#_ensmean to succeed, but the task that is defined is named mpassit_ensmean_g##

… will run in the event a forecast fails, replacing its IC with the pre-GETKF background and rerunning the forecast. This is intended for use in the real-time parallel.

guoqing-noaa · 2026-04-23T23:45:04Z

@AmandaBack-NOAA FYI, this has been fixed in PR #1457. Thanks!

guoqing-noaa · 2026-04-24T15:21:07Z

 export NODES_IC="<nodes>1:ppn=40</nodes>"
 export NODES_LBC="<nodes>1:ppn=40</nodes>"
 export NODES_FCST="<nodes>1:ppn=40</nodes>"
+export NODES_RESILIENT_FCST="<nodes>1:ppn=40</nodes>"


If we name the task with leading fcst, like fcst_resilient_xxxx, then we don't need separate NODES_RESILIENT_FCST. The get_cascase_env function will use NODE_FCST automatically.

guoqing-noaa · 2026-04-24T15:28:20Z

  mem_list=("000") # if determinitic
 fi

+if [[ "${RESILIENT_RESTART:-"FALSE"}" == "TRUE" ]]; then


If RESILIENT_RESTART applies only to ensembles, it would be preferred to include ENS in the variable name.
RESTART has some kind of special meaning in the MPAS-based rrfs-workflow. We usually would interpret it as related to restart.nc or config_do_restart

So it will be appreciated that we choose a different name for this variable.
For example: FCST_ENS_RERUN, FCST_ENS_RESILIENT, etc

Fixed to RESILIENT_ENSEMBLE. Thanks!

guoqing-noaa · 2026-04-24T15:29:44Z

 ${cpreq} "${EXECrrfs}"/rank_run.x .
-${MPI_RUN_CMD} ./rank_run.x "${DATA}/script_prep_ic_*.sh"
+if [[ "${RESILIENT_RESTART:-"FALSE"}" == "TRUE" ]]; then
+  ${MPI_RUN_CMD} ./rank_run.x "${DATA}/script_prep_ic_${pid}.sh"


So this is only to run one script?
If so, we don't need to do ${MPI_RUN_CMD} ./rank_run.x .
We can run ${DATA}/script_prep_ic_${pid}.sh directly

We can't run it directly unless we update its execute permissions, since the scripts this script creates only have r/w enabled. rank_run.x works around that and seems harmless to me.

guoqing-noaa · 2026-04-26T01:07:28Z

+        meta_end = ""
+        ensindexstr = ""
+    else:
+        meta_id = 'resilient_prep_ic'


lIne 62: suggest changing meta_id to prep_ic_resilient so that it can inherit any resource settings configured for prep_ic, such as NODE_PREP_IC, etc.

We don't need the nodes from NODE_PREP_IC since PREP_IC for ensemble runs several things simultaneously (1:ppn=40 while the rerun just needs 1:ppn=1, which it gets automatically in the current definition). However, I added a line to the exp.ens_conus* sample setup files to set the walltime for the rerun task to 10 minutes like the main PREP_IC requests.

@AmandaBack-NOAA
Thanks for mentioning that the resilient runs use different NODE settings. In this situation, we can define NODE_PREP_IC_RESILEINT explicitly in config_resources/config.base.

We have more resource definitions, such as 'WALLTIME', ACCOUNT, QUEUE, PARTITION, RESERVATION, NATIVE, etc, although they take default values most times.

But we have situations when prep_ic may configure some of the resources different from the defaults.
Using a consistent cascading way (i.e prep_ic_resilient) can facilitate this task to inherit changes made to prep_ic. Thanks!

We don't want it to inherit RESERVATION and it shouldn't use a START_TIME from PREP_IC either. Probably best to evaluate these settings on a case-by-case basis instead of making a broad rule of inheriting.

@AmandaBack-NOAA Thanks for the discussion. If possible, we would like to be consistent unless there are any unresolvable tech challenges. Most users will automatically classify prep_ic and its resilient counter part into the same category.

The version of the PR originally submitted used the default resources for RESILIENT_PREP_IC and specified resources for RESILIENT_FCST. The specified resources for RESILIENT_FCST were removed at your request @guoqing-noaa but as you've since pointed out there could be some issues arising from that.

For realtime runs, we need more tests to make sure a rerun of ensemble forecasts can be done in time without a reservation.

That's an interesting opinion. Right now, the procedure as delineated by @hu5970 is that someone should notice that real-time runs have stopped, should find the member whose forecast failed, manually replace that member's GETKF analysis with the background, and then reboot the forecast task. This PR automates that procedure.

For realtime runs, we need more tests to make sure a rerun of ensemble forecasts can be done in time without a reservation.

I mean, fcst task may stay in a queue for a long time if no reservation and hence may not resolve the issue in time.
Manual rebooting uses the same reservation.

I may misunderstand. Do you want to use reservations or not for the fcst rerun?
Also, if this PR is to address the realtime issue, we expect to modify the file exp/rt_ursa/exp.rrfsv2x_ens.

Ah. In that case it's right for the backup fcst task to inherit the reservation. Thanks!

guoqing-noaa · 2026-04-27T22:18:09Z

@AmandaBack-NOAA Sorry that I was on leave early last week and missed any related discussions.
I think I understand the context more now.
This is a good supplementary capability, but it is only used for RRFSv2X realtime runs in some specific situations (i.e., when we have frequent ensemble forecast crashes).

Retros don't need this, the operation will not need this either.
So I would suggest we hold this PR for the moment.

Another possible alternate to address occasional ensemble forecast crashes may be done at the automatic monitoring level instead of modifying the workflow. We can monitor any dead ensemble forecasts and if it is a DA-related crash, we can re-stage the background and then reboot the associated task.

When I ran the realtime AR-PS (Prediction System), I had a monitoring script which detects any hanging fcst jobs and then scancel and reboot them (/scratch4/BMC/zrtrr/gge/ARPS/PEAR/exp/rrfsdet/monitor_12hrly_workflow.sh). We may be able to do a similar thing here. Thanks!

Amanda Back added 4 commits April 23, 2026 18:43

Correct error trying to do math with number of com errors to clean fo…

313aaad

…rmatted as a string

Fix error that the task upp_ensmean_g## waits for the task mpassit_g#…

72aa1d6

…#_ensmean to succeed, but the task that is defined is named mpassit_ensmean_g##

Workflow changes to enable adding tasks to the ensemble workflow that…

ddafd63

… will run in the event a forecast fails, replacing its IC with the pre-GETKF background and rerunning the forecast. This is intended for use in the real-time parallel.

linter

7525f6f

AmandaBack-NOAA requested review from BenjaminBlake-NOAA, MatthewPyle-NOAA and ShunLiu-NOAA as code owners April 23, 2026 23:33

guoqing-noaa reviewed Apr 23, 2026

View reviewed changes

Comment thread workflow/sideload/clean.py

guoqing-noaa Apr 23, 2026

Copy link
Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandaBack-NOAA FYI, this has been fixed in PR #1457. Thanks!

guoqing-noaa reviewed Apr 24, 2026

View reviewed changes

guoqing-noaa requested review from chunhuazhou and hu5970 April 24, 2026 15:25

guoqing-noaa reviewed Apr 24, 2026

View reviewed changes

MatthewPyle-NOAA and others added 2 commits April 24, 2026 14:52

Merge branch 'rrfs-mpas-jedi' into resilient_pr

7122d4e

Variable naming changes requested by Guoqing

f82a507

guoqing-noaa reviewed Apr 26, 2026

View reviewed changes

resource request for the prep_ic task rerun

7398625

MatthewPyle-NOAA added 7 commits April 29, 2026 12:37

Merge branch 'rrfs-mpas-jedi' into resilient_pr

08158c5

Merge branch 'rrfs-mpas-jedi' into resilient_pr

a8d3233

Merge branch 'rrfs-mpas-jedi' into resilient_pr

28ef67f

Merge branch 'rrfs-mpas-jedi' into resilient_pr

9a06c3a

Merge branch 'rrfs-mpas-jedi' into resilient_pr

5f2ea6c

Merge branch 'rrfs-mpas-jedi' into resilient_pr

8f2e998

Merge branch 'rrfs-mpas-jedi' into resilient_pr

8597efb

Conversation

AmandaBack-NOAA commented Apr 23, 2026

DESCRIPTION OF CHANGES:

TESTS CONDUCTED:

Machines/Platforms:

Test cases:

ISSUE:

CONTRIBUTORS (optional):

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guoqing-noaa Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmandaBack-NOAA Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guoqing-noaa Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guoqing-noaa commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guoqing-noaa Apr 24, 2026 •

edited

Loading

AmandaBack-NOAA Apr 27, 2026 •

edited

Loading

guoqing-noaa Apr 27, 2026 •

edited

Loading