What happened?
Hi, first-time OneTrainer user, installed using the install.bat on this system:
Windows 10, 64GB RAM, RTX-3090 (two GPUs, but only use one for OneTrainer)
Attempted to train a simple Illustrious SDXL character LoRa. I reused a pre-existing image/caption folder that had been used with other training tools. The OneTrainer UI starts as expected and I selected the "#SDXL 1.0" configuration, modifying some options for this run with Illustrious.
After the "Start Training" button is selected, what appears to be pre-training setup looks OK, but then a fatal error occurs seconds after the first Step is started:
RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.
I tried downgrading PyTorch to 2.7.1+cu128 and experienced the same symptom. Also tried to use the mostly default "#SDXL 1.0" configuration settings, but still experienced the same error result.
My guess is that I've done something fundamentally wrong here, but wanted to report the Bug just in case it was more generalized
(startup+error log output pasted into the below field, with debug+config files attached)
What did you expect would happen?
I expected the step and epoch progress bars to move forward to completion, with the trained per-epoch files as a result.
Relevant log output
activating venv D:\OneTrainer\venv
Using Python "D:\OneTrainer\venv\Scripts\python.exe" -X utf8
HF_HUB_DISABLE_XET=1
NOTE: Xet disabled, to enable it set as 0 before launch
Checking Python version...
Python 3.11.9
Starting UI...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931588.588876 84084 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931590.939255 84084 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[INFO] Default settings file saved to E:\temp\tmpyukzbxje\default_settings.json
[INFO] Debug package saved to OneTrainer_debug_report.zip
Clearing cache directory D:/OneTrainer/workspace-cache/run! You can disable this if you want to continue using the same cache.
D:\OneTrainer\venv\Lib\site-packages\tensorboard\default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931705.443690 79756 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931707.931350 79756 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Selected layers: 722
Deselected layers: 72
Note: Enable Debug mode to see the full list of layer names
enumerating sample paths: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 199.80it/s]
enumerating sample paths: 0%| | 0/1 [00:00<?, ?it/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all | 0/35 [00:00<?, ?it/s]
TensorBoard 2.20.0 at http://localhost:6006/ (Press CTRL+C to quit)
caching: 100%|█████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00, 5.64it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 22.52it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 30/30 [00:10<00:00, 2.79it/s]
step: 0%| | 0/35 [01:24<?, ?it/s]
epoch: 0%| | 0/10 [01:33<?, ?it/s]
Traceback (most recent call last):
File "D:\OneTrainer\modules\ui\TrainUI.py", line 719, in __training_thread_function
trainer.train()
File "D:\OneTrainer\modules\trainer\GenericTrainer.py", line 796, in train
raise RuntimeError("Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.")
RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.
[INFO] Default settings file saved to C:\temp\default_settings.json
[INFO] Debug package saved to OneTrainer_debug_report.zip
Generate and upload debug_report.log
config.json
config_diff.txt
debug_report.log
What happened?
Hi, first-time OneTrainer user, installed using the install.bat on this system:
Windows 10, 64GB RAM, RTX-3090 (two GPUs, but only use one for OneTrainer)
Attempted to train a simple Illustrious SDXL character LoRa. I reused a pre-existing image/caption folder that had been used with other training tools. The OneTrainer UI starts as expected and I selected the "#SDXL 1.0" configuration, modifying some options for this run with Illustrious.
After the "Start Training" button is selected, what appears to be pre-training setup looks OK, but then a fatal error occurs seconds after the first Step is started:
RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.
I tried downgrading PyTorch to 2.7.1+cu128 and experienced the same symptom. Also tried to use the mostly default "#SDXL 1.0" configuration settings, but still experienced the same error result.
My guess is that I've done something fundamentally wrong here, but wanted to report the Bug just in case it was more generalized
(startup+error log output pasted into the below field, with debug+config files attached)
What did you expect would happen?
I expected the step and epoch progress bars to move forward to completion, with the trained per-epoch files as a result.
Relevant log output
Generate and upload debug_report.log
config.json
config_diff.txt
debug_report.log