Automatically replace GETKF analysis and rerun forecast for members whose forecast has failed.#1463
Automatically replace GETKF analysis and rerun forecast for members whose forecast has failed.#1463AmandaBack-NOAA wants to merge 14 commits into
Conversation
…rmatted as a string
…#_ensmean to succeed, but the task that is defined is named mpassit_ensmean_g##
… will run in the event a forecast fails, replacing its IC with the pre-GETKF background and rerunning the forecast. This is intended for use in the real-time parallel.
There was a problem hiding this comment.
@AmandaBack-NOAA FYI, this has been fixed in PR #1457. Thanks!
| export NODES_IC="<nodes>1:ppn=40</nodes>" | ||
| export NODES_LBC="<nodes>1:ppn=40</nodes>" | ||
| export NODES_FCST="<nodes>1:ppn=40</nodes>" | ||
| export NODES_RESILIENT_FCST="<nodes>1:ppn=40</nodes>" |
There was a problem hiding this comment.
If we name the task with leading fcst, like fcst_resilient_xxxx, then we don't need separate NODES_RESILIENT_FCST. The get_cascase_env function will use NODE_FCST automatically.
| mem_list=("000") # if determinitic | ||
| fi | ||
|
|
||
| if [[ "${RESILIENT_RESTART:-"FALSE"}" == "TRUE" ]]; then |
There was a problem hiding this comment.
If RESILIENT_RESTART applies only to ensembles, it would be preferred to include ENS in the variable name.
RESTART has some kind of special meaning in the MPAS-based rrfs-workflow. We usually would interpret it as related to restart.nc or config_do_restart
So it will be appreciated that we choose a different name for this variable.
For example: FCST_ENS_RERUN, FCST_ENS_RESILIENT, etc
There was a problem hiding this comment.
Fixed to RESILIENT_ENSEMBLE. Thanks!
| ${cpreq} "${EXECrrfs}"/rank_run.x . | ||
| ${MPI_RUN_CMD} ./rank_run.x "${DATA}/script_prep_ic_*.sh" | ||
| if [[ "${RESILIENT_RESTART:-"FALSE"}" == "TRUE" ]]; then | ||
| ${MPI_RUN_CMD} ./rank_run.x "${DATA}/script_prep_ic_${pid}.sh" |
There was a problem hiding this comment.
So this is only to run one script?
If so, we don't need to do ${MPI_RUN_CMD} ./rank_run.x .
We can run ${DATA}/script_prep_ic_${pid}.sh directly
There was a problem hiding this comment.
We can't run it directly unless we update its execute permissions, since the scripts this script creates only have r/w enabled. rank_run.x works around that and seems harmless to me.
| meta_end = "" | ||
| ensindexstr = "" | ||
| else: | ||
| meta_id = 'resilient_prep_ic' |
There was a problem hiding this comment.
lIne 62: suggest changing meta_id to prep_ic_resilient so that it can inherit any resource settings configured for prep_ic, such as NODE_PREP_IC, etc.
There was a problem hiding this comment.
We don't need the nodes from NODE_PREP_IC since PREP_IC for ensemble runs several things simultaneously (1:ppn=40 while the rerun just needs 1:ppn=1, which it gets automatically in the current definition). However, I added a line to the exp.ens_conus* sample setup files to set the walltime for the rerun task to 10 minutes like the main PREP_IC requests.
There was a problem hiding this comment.
@AmandaBack-NOAA
Thanks for mentioning that the resilient runs use different NODE settings. In this situation, we can define NODE_PREP_IC_RESILEINT explicitly in config_resources/config.base.
We have more resource definitions, such as 'WALLTIME', ACCOUNT, QUEUE, PARTITION, RESERVATION, NATIVE, etc, although they take default values most times.
But we have situations when prep_ic may configure some of the resources different from the defaults.
Using a consistent cascading way (i.e prep_ic_resilient) can facilitate this task to inherit changes made to prep_ic. Thanks!
There was a problem hiding this comment.
We don't want it to inherit RESERVATION and it shouldn't use a START_TIME from PREP_IC either. Probably best to evaluate these settings on a case-by-case basis instead of making a broad rule of inheriting.
There was a problem hiding this comment.
@AmandaBack-NOAA Thanks for the discussion. If possible, we would like to be consistent unless there are any unresolvable tech challenges. Most users will automatically classify prep_ic and its resilient counter part into the same category.
There was a problem hiding this comment.
The version of the PR originally submitted used the default resources for RESILIENT_PREP_IC and specified resources for RESILIENT_FCST. The specified resources for RESILIENT_FCST were removed at your request @guoqing-noaa but as you've since pointed out there could be some issues arising from that.
There was a problem hiding this comment.
For realtime runs, we need more tests to make sure a rerun of ensemble forecasts can be done in time without a reservation.
There was a problem hiding this comment.
That's an interesting opinion. Right now, the procedure as delineated by @hu5970 is that someone should notice that real-time runs have stopped, should find the member whose forecast failed, manually replace that member's GETKF analysis with the background, and then reboot the forecast task. This PR automates that procedure.
There was a problem hiding this comment.
For realtime runs, we need more tests to make sure a rerun of ensemble forecasts can be done in time without a reservation.
I mean, fcst task may stay in a queue for a long time if no reservation and hence may not resolve the issue in time.
Manual rebooting uses the same reservation.
I may misunderstand. Do you want to use reservations or not for the fcst rerun?
Also, if this PR is to address the realtime issue, we expect to modify the file exp/rt_ursa/exp.rrfsv2x_ens.
There was a problem hiding this comment.
Ah. In that case it's right for the backup fcst task to inherit the reservation. Thanks!
|
@AmandaBack-NOAA Sorry that I was on leave early last week and missed any related discussions. Retros don't need this, the operation will not need this either. Another possible alternate to address occasional ensemble forecast crashes may be done at the automatic monitoring level instead of modifying the workflow. We can monitor any dead ensemble forecasts and if it is a DA-related crash, we can re-stage the background and then reboot the associated task. When I ran the realtime AR-PS (Prediction System), I had a monitoring script which detects any hanging fcst jobs and then scancel and reboot them ( |
This PR introduces the capability to automatically replace an ensemble member's analysis with the pre-GETKF background if one of the forecasts fails. It is intended only for the real-time system so it doesn't need a babysitter around the clock. For retros this is not recommended--the added tasks that only run if needed will prevent cycles from completing and hence later cycles from starting. In addition, tasks with a dependency on the fcst task will not automatically start if the fallback fcst runs instead (this doesn't affect real-time, as the only dependent task--save_for_next--uses a datadep rather than a taskdep). Note that workarounds for these limitations do exist in rocoto and can be added to rrfs-workflow if desired, but (my opinion) people running retros should probably be attentive to their failed tasks.
The PR also fixes 2 minor bugs that needed to be corrected to allow my testing:
DESCRIPTION OF CHANGES:
TESTS CONDUCTED:
Tested on Gaea with conus12km cycled ensembles
Machines/Platforms:
Test cases:
ISSUE:
#1462
CONTRIBUTORS (optional):