git commit -m "Fix sample_rate calculation with WeightedRandomSampler#814
git commit -m "Fix sample_rate calculation with WeightedRandomSampler#814MN-NR wants to merge 2 commits into
Conversation
When DataLoader uses WeightedRandomSampler, sample_rate was incorrectly computed from len(data_loader) instead of the true dataset size, causing privacy budget to burn 100x-1000x faster than expected. Fixed by capturing dataset size and batch_size before _prepare_data_loader() replaces the sampler, ensuring privacy accounting uses actual dataset size. Fixes #[813]"
|
Hi @MN-Noor! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
|
@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D100263104. (Because this pull request was imported automatically, there will not be any future comments.) |
Types of changes
Motivation and Context / Related issue
Fixes #813
When
DataLoaderusesWeightedRandomSampler, privacy accounting was silently broken. Thesample_ratewas computed fromlen(data_loader)(number of batches) instead of the true dataset size, causing privacy budget to burn 100x-1000x faster than expected.Root Cause
sample_rate = 1 / len(data_loader)was computed AFTER_prepare_data_loader()replaced the samplerWeightedRandomSamplerwithnum_samples=128andbatch_size=16on a 100k dataset:sample_rate = 1/8 = 0.125(8 batches)sample_rate = 16/100000 = 0.00016Fix
Capture
true_dataset_sizeandoriginal_batch_sizeBEFORE_prepare_data_loader()modifies the sampler, ensuring privacy accounting uses the actual dataset size.How Has This Been Tested
0.00016instead of0.12516instead of12500Checklist