Skip to content

Add scripts to test GPUs in the cluster#16

Draft
lauraporta wants to merge 1 commit into
mainfrom
cluster-gpu-tests
Draft

Add scripts to test GPUs in the cluster#16
lauraporta wants to merge 1 commit into
mainfrom
cluster-gpu-tests

Conversation

@lauraporta

Copy link
Copy Markdown
Member

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
This pull request introduces a set of generalized scripts for running arbitrary Python scripts across all available GPU nodes in an HPC cluster. You can specify which conda environment to use and with which module to test your script.

There are a few hardcoded parameters related to our cluster, as the name of the partitions.

I originally made these scripts to test the usage of Cellpose across GPUs. I see sometimes erratic behavior across GPUs, it is useful to know which ones to exclude when submitting an array of SLURM jobs.

What does this PR do?

  • Added gpu_node_runner.sh, a robust SBATCH script for running Python scripts on a single GPU node.
  • Added run_on_all_gpu_nodes.sh, a wrapper script that discovers all available GPU nodes across multiple partitions and submits jobs to each node, tracking job submissions and failures.
  • Created README.md with usage instructions, argument descriptions.
  • Added example_gpu_test.py, a template Python script that performs basic GPU checks using PyTorch, intended for use with the runner scripts.

How has this PR been tested?

Run in our cluster.

Is this a breaking change?

No

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant