use more robust validation, defined at dataset level#806
Merged
ShriyaRishab merged 4 commits intoJul 25, 2025
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Contributor
There was a problem hiding this comment.
To avoid confusion, can we update the pseudocode in Quality metric section so t is not computed but is obtained directly from the dataset? It will help submitters to see pseudocode for what they need to implement instead of how the validation dataset was originally generated.
You can add an appendix in the end with the pseudocode used to generate the timestamps for the validation dataset as an FYI
ShriyaRishab
approved these changes
Jul 25, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pr introduces changes to the validation to make it more uniform amongst submitters, making errors harder. Changes are as follows:
Background
During a model forward step, the model attempts to denoise a latent. To do this, we take a latent from an image (will be our ground truth) and add noise to it. The amount of noise we add depends on the
timestep, a value from 0 to 1. Naturally, the more noise we add, the harder the denoising task, and so the larger the loss we should expect.Validation
For validation, we follow the flux paper. We 8 equally spaced timesteps from [0, 1) -> (0, 1/8, 2/8, 3/8, 4/8, 5/8, 6/8, 7/8) and try to sample equally from them.
This means we have to select one of these timesteps for each validation sample.
Current approach
Currently, this is done dynamically at train time. If we let each sample have a # which corresponds to its order, its timestep will be (# % 8) / 8. Basically, we cycle from 0 to 7 over and over.
While for standard training this is fine, I realized there are a few edge cases which dont make this ideal for a benchmark:
Proposed solution
Rather than doing this dynamically, I propose that, at validation dataset creation time, we associate each sample with a timestep. This sample will always be evaluated with the same timestep regardless of parallelisms or framework. This ensures everyone calculates exactly the same validation metric.
The order used for this would be the exact same the reference currently generates dynamically, so there would be no need to regenerate RCPs (I did verify this anyway and the convergence is unchanged)