[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node by jameslai-dev · Pull Request #808 · mlcommons/training

jameslai-dev · 2025-07-28T08:32:20Z

Dear MLCommons team,

I appreciate your work, which has helped me verify the training performance after applying hardware resource virtualization to our bare-metal server.

I want to validate the model training performance in a reasonable time on both virtualized and non-virtualized environments. However, this is very challenging with the default Open Images dataset and our relatively small GPUs in a single node. Thus, I made some changes to the performance evaluation script, which may help others with similar use cases.

This pull request modifies the run_and_time.sh of SSD training, and introduces the following changes:

Added DATASET environment variable for dataset customization. This allows the script to run the training script with datasets other than the default Open Image dataset. (For example, I use the coco dataset.)
(Reverted: [single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808 (comment))
Changed --nproc_per_node=1 of torchrun command line to --nproc_per_node=${DGXNGPU}. This enables the script to do training with more GPUs on a single node.

github-actions · 2025-07-28T08:32:29Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

jameslai-dev · 2025-07-31T00:55:46Z

recheck

jameslai-dev · 2025-08-05T00:58:50Z

recheck

ShriyaRishab · 2025-08-08T15:32:42Z

 EVALBATCHSIZE=${EVALBATCHSIZE:-${BATCHSIZE}}
 NUMEPOCHS=${NUMEPOCHS:-30}
 LOG_INTERVAL=${LOG_INTERVAL:-20}
+DATASET=${DATASET:-"openimages-mlperf"}


MLPerf requires us to use the same dataset as in the reference to ensure results are comparable across submissions. So we should not make it a changeable parameter

Understood. I've reverted the dataset customization commit.

This reverts commit c37ae36.

jameslai-dev added 2 commits July 28, 2025 15:30

Added support for dataset customization

c37ae36

Added support for setting GPUs on single node training

5238041

jameslai-dev requested a review from a team as a code owner July 28, 2025 08:32

ShriyaRishab reviewed Aug 8, 2025

View reviewed changes

Revert "Added support for dataset customization"

5e3e7c0

This reverts commit c37ae36.

jameslai-dev changed the title ~~[single_stage_detector] Updated run_and_time.sh for customizing datasets and number of GPUs~~ [single_stage_detector] Updated run_and_time.sh for customizing number of GPUs Aug 9, 2025

jameslai-dev changed the title ~~[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs~~ [single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node Aug 9, 2025

ShriyaRishab approved these changes Aug 11, 2025

View reviewed changes

ShriyaRishab merged commit c7b2283 into mlcommons:master Aug 11, 2025
1 check passed

github-actions Bot locked and limited conversation to collaborators Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node#808

[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node#808
ShriyaRishab merged 3 commits into
mlcommons:masterfrom
jameslai-dev:dev_ssd_custom_dataset_gpus

jameslai-dev commented Jul 28, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 28, 2025 •

edited

Loading

Uh oh!

jameslai-dev commented Jul 31, 2025

Uh oh!

jameslai-dev commented Aug 5, 2025

Uh oh!

ShriyaRishab Aug 8, 2025

Uh oh!

jameslai-dev Aug 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jameslai-dev commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameslai-dev commented Jul 31, 2025

Uh oh!

jameslai-dev commented Aug 5, 2025

Uh oh!

ShriyaRishab Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

jameslai-dev Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jameslai-dev commented Jul 28, 2025 •

edited

Loading

github-actions Bot commented Jul 28, 2025 •

edited

Loading