Skip to content

use more robust validation, defined at dataset level#806

Merged
ShriyaRishab merged 4 commits into
mlcommons:masterfrom
CarlosGomes98:flux/robust_validation
Jul 25, 2025
Merged

use more robust validation, defined at dataset level#806
ShriyaRishab merged 4 commits into
mlcommons:masterfrom
CarlosGomes98:flux/robust_validation

Conversation

@CarlosGomes98

Copy link
Copy Markdown
Contributor

This pr introduces changes to the validation to make it more uniform amongst submitters, making errors harder. Changes are as follows:

Background

During a model forward step, the model attempts to denoise a latent. To do this, we take a latent from an image (will be our ground truth) and add noise to it. The amount of noise we add depends on the timestep, a value from 0 to 1. Naturally, the more noise we add, the harder the denoising task, and so the larger the loss we should expect.

Validation

For validation, we follow the flux paper. We 8 equally spaced timesteps from [0, 1) -> (0, 1/8, 2/8, 3/8, 4/8, 5/8, 6/8, 7/8) and try to sample equally from them.
This means we have to select one of these timesteps for each validation sample.

Current approach

Currently, this is done dynamically at train time. If we let each sample have a # which corresponds to its order, its timestep will be (# % 8) / 8. Basically, we cycle from 0 to 7 over and over.
While for standard training this is fine, I realized there are a few edge cases which dont make this ideal for a benchmark:

  1. There might be some weird combinations of batch sizes and numbers of devices that dont evenly divide the validation dataset. This might mean some timesteps have slightly more samples than others, so folks would calculate slightly different validation losses.
  2. We rely on folks to correctly implement the same logic as the reference. If they dont, they might calculate a different validation loss.

Proposed solution

Rather than doing this dynamically, I propose that, at validation dataset creation time, we associate each sample with a timestep. This sample will always be evaluated with the same timestep regardless of parallelisms or framework. This ensures everyone calculates exactly the same validation metric.
The order used for this would be the exact same the reference currently generates dynamically, so there would be no need to regenerate RCPs (I did verify this anyway and the convergence is unchanged)

@CarlosGomes98 CarlosGomes98 requested a review from a team as a code owner July 23, 2025 12:58
@github-actions

github-actions Bot commented Jul 23, 2025

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Comment thread text_to_image/README.md

@ShriyaRishab ShriyaRishab Jul 23, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion, can we update the pseudocode in Quality metric section so t is not computed but is obtained directly from the dataset? It will help submitters to see pseudocode for what they need to implement instead of how the validation dataset was originally generated.

You can add an appendix in the end with the pseudocode used to generate the timestamps for the validation dataset as an FYI

@ShriyaRishab ShriyaRishab merged commit c627b4a into mlcommons:master Jul 25, 2025
1 check passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jul 25, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants