Fix parentless spawning.#7237
Conversation
a485c53 to
b39e758
Compare
b39e758 to
0dcfc96
Compare
5b3b179 to
fa6accf
Compare
|
Branch base is master, but milestone is 8.6.x. This changes some quite fundamental stuff, so it might make sense to leave this on master? |
|
Yeah, that's why I put it on master initially, but I'm not entirely sure - it still fixes an important bug and does not add any new features. (And stating the obvious, but the fundamental stuff needed changing because over time it had become a right mess). |
|
Ok, will continue reviewing against master for now... |
|
|
||
| self.pool.compute_runahead() | ||
| self.pool.release_runahead_tasks() | ||
| await self.workflow_shutdown() |
There was a problem hiding this comment.
Perhaps take this opportunity to rename workflow_shutdown to e.g. set_stop_mode
There was a problem hiding this comment.
Can we punt that as off-topic? From a quick look it doesn't just set the stop mode, it might also shut the scheduler down.
There was a problem hiding this comment.
It's a little confusing that we would attempt shutdown so soon after the start of main loop... Could you explain why this has been moved here?
At the least, I think a comment would be useful:
| await self.workflow_shutdown() | |
| # If applicable, set stop mode or shutdown on task failure: | |
| await self.workflow_shutdown() |
Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
|
All review comments addressed @MetRonnie |
Close #7235
Close #5730
Premature shutdown or stall can result if a parented instance of a sometimes parentless task ends up at the runahead limit.
See #7235 (comment)
The correct way to do this is: for any task that is parentless in one or more recurrences, always spawn the next parentless instance - which may lie beyond multiple parented instances and/or beyond the runahead limit.
In effect, a parentless instance should always spawn the next parentless instance, at runahead release time.
NOTE my first attempt at this bug fix ran into trouble because the task pool spawning logic has gradually become too complicated to follow easily, so I bit the bullet and tried to rethink it.
As a result: this PR is a significant simplification of the scheduler core:. E.g. calls to
compute_runahead()in the code: ~10 down to ~1; andgit diff master cylc/flow: 114 insertions(+), 218 deletions(-)On master: every time anything happens to spawn a task, the task pool recomputes the runahead limit and spawns and releases and queues any parentless instances of that task all the way to the limit (which is recursive, within a single main loop iteration).
On this branch: the task pool only spawns the one instance, not downstream consequences of it. The main loop then computes the limit, releases instances below the limit, and (on release) spawns the single next parentless instance (if there is one).
However, I do a single spawn-to-rh-limit at startup (not necessary, but many current integration tests expect that).Note zero functional tests broke despite the many changes to the scheduler guts here.
The only consequences to be aware of are:
await schd._main_loop()and/orschd.pool.spawn_to_runahead_limit()before checking downstream consequences (i.e. beyond immediate spawn) of operations such astriggerandset(note this actually didn't break very many existing integration tests, and they were easily fixed)Check List
CONTRIBUTING.mdand added my name as a Code Contributor.setup.cfg(andconda-environment.ymlif present).?.?.xbranch.