Skip to content

Add memory per retry | Refactor GPU parameters#38

Open
RhettRautsaw wants to merge 2 commits into
miniwdl-ext:developfrom
RhettRautsaw:develop
Open

Add memory per retry | Refactor GPU parameters#38
RhettRautsaw wants to merge 2 commits into
miniwdl-ext:developfrom
RhettRautsaw:develop

Conversation

@RhettRautsaw

Copy link
Copy Markdown
Contributor

Changes:

  • Refactor SLURM account, partition, qos handling logic for cleaner code
  • Increase requested Slurm memory by 1.5x on each task retry.
    • Most tasks that fail are due to out-of-memory errors.
    • This addition will increase memory by 1.5x per retry
  • As an alternative to the [task_runtime] level changes. I've included an example of how to use the dynamic partition for gpu and non-gpu tasks instead.

@RhettRautsaw

Copy link
Copy Markdown
Contributor Author

Added additional commit to only try increasing memory when OOM exit codes (137 or 253) are encountered. Added a logging line to let users know when memory is being increased on a retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant