-
Notifications
You must be signed in to change notification settings - Fork 67
Automatically replace GETKF analysis and rerun forecast for members whose forecast has failed. #1463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: rrfs-mpas-jedi
Are you sure you want to change the base?
Automatically replace GETKF analysis and rerun forecast for members whose forecast has failed. #1463
Changes from 6 commits
313aaad
72aa1d6
ddafd63
7525f6f
7122d4e
f82a507
7398625
08158c5
a8d3233
28ef67f
9a06c3a
5f2ea6c
8f2e998
8597efb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,10 +3,10 @@ | |
| import textwrap | ||
| from rocoto_funcs.base import xml_task, get_cascade_env | ||
|
|
||
| # begin of fcst -------------------------------------------------------- | ||
| # begin of prep_ic -------------------------------------------------------- | ||
|
|
||
|
|
||
| def prep_ic(xmlFile, expdir, do_ensemble=False, spinup_mode=0): | ||
| def prep_ic(xmlFile, expdir, do_ensemble=False, spinup_mode=0, resilient=False): | ||
| # spinup_mode: | ||
| # 0 = no parallel spinup cycles in the experiment | ||
| # 1 = a spinup cycle | ||
|
|
@@ -52,6 +52,28 @@ def prep_ic(xmlFile, expdir, do_ensemble=False, spinup_mode=0): | |
|
|
||
| if "global" in os.getenv("MESH_NAME"): | ||
| dcTaskEnv['cpreq'] = "ln -snf" | ||
| if not (do_ensemble and resilient): | ||
| metatask = False | ||
| meta_id = "" | ||
| meta_bgn = "" | ||
| meta_end = "" | ||
| ensindexstr = "" | ||
| else: | ||
| meta_id = 'resilient_prep_ic' | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lIne 62: suggest changing
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't need the nodes from NODE_PREP_IC since PREP_IC for ensemble runs several things simultaneously (1:ppn=40 while the rerun just needs 1:ppn=1, which it gets automatically in the current definition). However, I added a line to the exp.ens_conus* sample setup files to set the walltime for the rerun task to 10 minutes like the main PREP_IC requests.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @AmandaBack-NOAA We have more resource definitions, such as 'WALLTIME', But we have situations when prep_ic may configure some of the resources different from the defaults.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't want it to inherit RESERVATION and it shouldn't use a START_TIME from PREP_IC either. Probably best to evaluate these settings on a case-by-case basis instead of making a broad rule of inheriting.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @AmandaBack-NOAA Thanks for the discussion. If possible, we would like to be consistent unless there are any unresolvable tech challenges. Most users will automatically classify
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The version of the PR originally submitted used the default resources for RESILIENT_PREP_IC and specified resources for RESILIENT_FCST. The specified resources for RESILIENT_FCST were removed at your request @guoqing-noaa but as you've since pointed out there could be some issues arising from that.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For realtime runs, we need more tests to make sure a rerun of ensemble forecasts can be done in time without a reservation.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's an interesting opinion. Right now, the procedure as delineated by @hu5970 is that someone should notice that real-time runs have stopped, should find the member whose forecast failed, manually replace that member's GETKF analysis with the background, and then reboot the forecast task. This PR automates that procedure.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I mean, fcst task may stay in a queue for a long time if no reservation and hence may not resolve the issue in time. I may misunderstand. Do you want to use reservations or not for the fcst rerun?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah. In that case it's right for the backup fcst task to inherit the reservation. Thanks! |
||
| metatask = True | ||
| task_id = f'{meta_id}_m#ens_index#' | ||
| dcTaskEnv['ENS_INDEX'] = "#ens_index#" | ||
| dcTaskEnv['RESILIENT_ENSEMBLE'] = "TRUE" | ||
| meta_bgn = "" | ||
| meta_end = "" | ||
| ens_size = int(os.getenv('ENS_SIZE', '2')) | ||
| ens_indices = ''.join(f'{i:03d} ' for i in range(1, int(ens_size) + 1)).strip() | ||
| meta_bgn = f''' | ||
| <metatask name="{meta_id}"> | ||
| <var name="ens_index">{ens_indices}</var>''' | ||
| meta_end = f'\ | ||
| </metatask>\n' | ||
| ensindexstr = "_m#ens_index#" | ||
| dcTaskEnv['KEEPDATA'] = get_cascade_env(f"KEEPDATA_{task_id}".upper()).upper() | ||
| # dependencies | ||
| coldhrs = coldhrs.split(' ') | ||
|
|
@@ -192,6 +214,14 @@ def prep_ic(xmlFile, expdir, do_ensemble=False, spinup_mode=0): | |
| </or> | ||
| </and> | ||
| </dependency>''' | ||
|
|
||
| # if this is a re-run because fcst died, that is the lone dependency. | ||
| if resilient: | ||
| dependencies = f''' | ||
| <dependency> | ||
| <taskdep state="Dead" task="fcst{ensindexstr}"/> | ||
| </dependency>''' | ||
| # | ||
| xml_task(xmlFile, expdir, task_id, cycledefs, dcTaskEnv, dependencies, command_id="PREP_IC") | ||
| # end of fcst -------------------------------------------------------- | ||
| xml_task(xmlFile, expdir, task_id, cycledefs, dcTaskEnv, dependencies, | ||
| metatask, meta_id, meta_bgn, meta_end, "PREP_IC") | ||
| # end of prep_ic -------------------------------------------------------- | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is only to run one script?
If so, we don't need to do
${MPI_RUN_CMD} ./rank_run.x.We can run
${DATA}/script_prep_ic_${pid}.shdirectlyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't run it directly unless we update its execute permissions, since the scripts this script creates only have r/w enabled. rank_run.x works around that and seems harmless to me.