Replies: 3 comments
-
|
converting to discussion - this is likely not a bug but caused by your (quite unusual) choice of training parameters. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @dxqb thanks for your helpful reply! Admittedly, it may have been my poor mapping of Kohya configuration settings to the .JSON that I did by hand, in these cases. That's why I also tried using the sample #SDXL 1.0 profile which came with OneTrainer from the repo, but that also led to the NaN problem. I just looked for some sample configurations and pulled a couple of them down from Reddit - both of those led to the same Nan issue. Also tried different base models, optimizers, etc. In looking at VRAM usage, I was reaching about half of the card's capacity when the Nan message occurs. EDIT: So, even though I've used my original dataset with other tools to train SDXL, Pony and Wan LoRas (with minor modifications per each), for some reason it's causing a problem for OneTrainer. Update 2: Number of Concept (image+text file) pairs >= Training-Local Batch Size So, with Batch size = 4, it would train fine with three of my image+text pairs in the Concept. But if I added another pair, then the Nan occurs when attempting to train. Changing the Local Batch Size value doesn't change that condition: if I set Batch size = 6, then it will train with up to five image+text pairs in the Concept, but the sixth pair will lead to the Nan condition. Looking into the images, themselves . . . |
Beta Was this translation helpful? Give feedback.
-
|
After more experimenting, it seems that the underlying issue was related to the image files in my dataset. Files were in .PNG and enhanced in Topaz Photo, sizes ranged from 1MB to 6MB. I resaved those originals as PNG with a different program, which ended up reducing the range of sizes to 0.9MB - 1.7MB and stripped all extra metadata. And now OneTrainer no longer throws Nan errors for my dataset. Was it their sizes or something unexpected in the PNG format that was added by Topaz Photo? Not sure, but I could experiment more to check, if curious. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
What happened?
Hi, first-time OneTrainer user, installed using the install.bat on this system:
Windows 10, 64GB RAM, RTX-3090 (two GPUs, but only use one for OneTrainer)
Attempted to train a simple Illustrious SDXL character LoRa. I reused a pre-existing image/caption folder that had been used with other training tools. The OneTrainer UI starts as expected and I selected the "#SDXL 1.0" configuration, modifying some options for this run with Illustrious.
After the "Start Training" button is selected, what appears to be pre-training setup looks OK, but then a fatal error occurs seconds after the first Step is started:
RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.
I tried downgrading PyTorch to 2.7.1+cu128 and experienced the same symptom. Also tried to use the mostly default "#SDXL 1.0" configuration settings, but still experienced the same error result.
My guess is that I've done something fundamentally wrong here, but wanted to report the Bug just in case it was more generalized
(startup+error log output pasted into the below field, with debug+config files attached)
What did you expect would happen?
I expected the step and epoch progress bars to move forward to completion, with the trained per-epoch files as a result.
Relevant log output
Generate and upload debug_report.log
config.json
config_diff.txt
debug_report.log
Beta Was this translation helpful? Give feedback.
All reactions