Skip to content

Automatically replace GETKF analysis and rerun forecast for members whose forecast has failed.#1463

Open
AmandaBack-NOAA wants to merge 14 commits into
NOAA-EMC:rrfs-mpas-jedifrom
AmandaBack-NOAA:resilient_pr
Open

Automatically replace GETKF analysis and rerun forecast for members whose forecast has failed.#1463
AmandaBack-NOAA wants to merge 14 commits into
NOAA-EMC:rrfs-mpas-jedifrom
AmandaBack-NOAA:resilient_pr

Conversation

@AmandaBack-NOAA

Copy link
Copy Markdown
Contributor

This PR introduces the capability to automatically replace an ensemble member's analysis with the pre-GETKF background if one of the forecasts fails. It is intended only for the real-time system so it doesn't need a babysitter around the clock. For retros this is not recommended--the added tasks that only run if needed will prevent cycles from completing and hence later cycles from starting. In addition, tasks with a dependency on the fcst task will not automatically start if the fallback fcst runs instead (this doesn't affect real-time, as the only dependent task--save_for_next--uses a datadep rather than a taskdep). Note that workarounds for these limitations do exist in rocoto and can be added to rrfs-workflow if desired, but (my opinion) people running retros should probably be attentive to their failed tasks.

The PR also fixes 2 minor bugs that needed to be corrected to allow my testing:

  • clean.py currently tries to do math with a string; the string is now passed to the function in question wrapped in int()
  • ensmean upp has a taskdep on a task that does not exist. The taskdep has been revised to reflect the correct task name.

DESCRIPTION OF CHANGES:

  • Added variable RESILIENT_RESTART which, when set to true, will add tasks to the enkf workflow to run only in the event that a forecast task dies.
  • Tasks resilient_prep_ic_m## and resilient_fcst_m## are defined using the existing scripts for prep_ic and fcst to ensure consistency between the main tasks and the restart replacements. the script for prep_ic is lightly modified to be a metatask over members ONLY for the resilient restart version. Node and walltime for resilient_fcst are added to the sample workflows.
  • sideload/clean.py corrected to stop doing math with strings (1-line change)
  • rocoto_funcs/upp.py writes the correct ensmean mpassit task name when defining the ensmean upp task.

TESTS CONDUCTED:

Tested on Gaea with conus12km cycled ensembles

  • Confirmed the system is unaffected as long as the variable RESILIENT_RESTART is false or never defined
  • Confirmed that when RESILIENT_RESTART=TRUE the desired tasks are added and will run automatically when a forecast fails. Furthermore, the system runs normally when forecasts do not fail.

Machines/Platforms:

  • WCOSS2
    • Cactus/Dogwood
    • Acorn
  • RDHPCS
    • Hera
    • Jet
    • Orion
    • Hercules

Test cases:

  • Engineering tests
    • Non-DA engineering test
    • DA engineering test
      • Retro
      • Ensemble
      • Parallel
  • RRFS fire weather
  • RRFS_A:
  • RRFS_B:
  • RTMA:
  • Others:

ISSUE:

#1462

CONTRIBUTORS (optional):

Amanda Back added 4 commits April 23, 2026 18:43
…#_ensmean to succeed, but the task that is defined is named mpassit_ensmean_g##
… will run in the event a forecast fails, replacing its IC with the pre-GETKF background and rerunning the forecast. This is intended for use in the real-time parallel.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandaBack-NOAA FYI, this has been fixed in PR #1457. Thanks!

Comment thread workflow/exp/exp.ens_conus12km Outdated
export NODES_IC="<nodes>1:ppn=40</nodes>"
export NODES_LBC="<nodes>1:ppn=40</nodes>"
export NODES_FCST="<nodes>1:ppn=40</nodes>"
export NODES_RESILIENT_FCST="<nodes>1:ppn=40</nodes>"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we name the task with leading fcst, like fcst_resilient_xxxx, then we don't need separate NODES_RESILIENT_FCST. The get_cascase_env function will use NODE_FCST automatically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread scripts/exrrfs_prep_ic.sh Outdated
mem_list=("000") # if determinitic
fi

if [[ "${RESILIENT_RESTART:-"FALSE"}" == "TRUE" ]]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If RESILIENT_RESTART applies only to ensembles, it would be preferred to include ENS in the variable name.
RESTART has some kind of special meaning in the MPAS-based rrfs-workflow. We usually would interpret it as related to restart.nc or config_do_restart

So it will be appreciated that we choose a different name for this variable.
For example: FCST_ENS_RERUN, FCST_ENS_RESILIENT, etc

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to RESILIENT_ENSEMBLE. Thanks!

Comment thread scripts/exrrfs_prep_ic.sh
${cpreq} "${EXECrrfs}"/rank_run.x .
${MPI_RUN_CMD} ./rank_run.x "${DATA}/script_prep_ic_*.sh"
if [[ "${RESILIENT_RESTART:-"FALSE"}" == "TRUE" ]]; then
${MPI_RUN_CMD} ./rank_run.x "${DATA}/script_prep_ic_${pid}.sh"

@guoqing-noaa guoqing-noaa Apr 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is only to run one script?
If so, we don't need to do ${MPI_RUN_CMD} ./rank_run.x .
We can run ${DATA}/script_prep_ic_${pid}.sh directly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't run it directly unless we update its execute permissions, since the scripts this script creates only have r/w enabled. rank_run.x works around that and seems harmless to me.

meta_end = ""
ensindexstr = ""
else:
meta_id = 'resilient_prep_ic'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lIne 62: suggest changing meta_id to prep_ic_resilient so that it can inherit any resource settings configured for prep_ic, such as NODE_PREP_IC, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the nodes from NODE_PREP_IC since PREP_IC for ensemble runs several things simultaneously (1:ppn=40 while the rerun just needs 1:ppn=1, which it gets automatically in the current definition). However, I added a line to the exp.ens_conus* sample setup files to set the walltime for the rerun task to 10 minutes like the main PREP_IC requests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandaBack-NOAA
Thanks for mentioning that the resilient runs use different NODE settings. In this situation, we can define NODE_PREP_IC_RESILEINT explicitly in config_resources/config.base.

We have more resource definitions, such as 'WALLTIME', ACCOUNT, QUEUE, PARTITION, RESERVATION, NATIVE, etc, although they take default values most times.

But we have situations when prep_ic may configure some of the resources different from the defaults.
Using a consistent cascading way (i.e prep_ic_resilient) can facilitate this task to inherit changes made to prep_ic. Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want it to inherit RESERVATION and it shouldn't use a START_TIME from PREP_IC either. Probably best to evaluate these settings on a case-by-case basis instead of making a broad rule of inheriting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandaBack-NOAA Thanks for the discussion. If possible, we would like to be consistent unless there are any unresolvable tech challenges. Most users will automatically classify prep_ic and its resilient counter part into the same category.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of the PR originally submitted used the default resources for RESILIENT_PREP_IC and specified resources for RESILIENT_FCST. The specified resources for RESILIENT_FCST were removed at your request @guoqing-noaa but as you've since pointed out there could be some issues arising from that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For realtime runs, we need more tests to make sure a rerun of ensemble forecasts can be done in time without a reservation.

@AmandaBack-NOAA AmandaBack-NOAA Apr 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting opinion. Right now, the procedure as delineated by @hu5970 is that someone should notice that real-time runs have stopped, should find the member whose forecast failed, manually replace that member's GETKF analysis with the background, and then reboot the forecast task. This PR automates that procedure.

@guoqing-noaa guoqing-noaa Apr 27, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For realtime runs, we need more tests to make sure a rerun of ensemble forecasts can be done in time without a reservation.

I mean, fcst task may stay in a queue for a long time if no reservation and hence may not resolve the issue in time.
Manual rebooting uses the same reservation.

I may misunderstand. Do you want to use reservations or not for the fcst rerun?
Also, if this PR is to address the realtime issue, we expect to modify the file exp/rt_ursa/exp.rrfsv2x_ens.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. In that case it's right for the backup fcst task to inherit the reservation. Thanks!

@guoqing-noaa

Copy link
Copy Markdown
Contributor

@AmandaBack-NOAA Sorry that I was on leave early last week and missed any related discussions.
I think I understand the context more now.
This is a good supplementary capability, but it is only used for RRFSv2X realtime runs in some specific situations (i.e., when we have frequent ensemble forecast crashes).

Retros don't need this, the operation will not need this either.
So I would suggest we hold this PR for the moment.

Another possible alternate to address occasional ensemble forecast crashes may be done at the automatic monitoring level instead of modifying the workflow. We can monitor any dead ensemble forecasts and if it is a DA-related crash, we can re-stage the background and then reboot the associated task.

When I ran the realtime AR-PS (Prediction System), I had a monitoring script which detects any hanging fcst jobs and then scancel and reboot them (/scratch4/BMC/zrtrr/gge/ARPS/PEAR/exp/rrfsdet/monitor_12hrly_workflow.sh). We may be able to do a similar thing here. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants