Skip to content

[Bug]: "RuntimeError: Training loss became NaN" #1380

Description

@ooofest

What happened?

Hi, first-time OneTrainer user, installed using the install.bat on this system:

Windows 10, 64GB RAM, RTX-3090 (two GPUs, but only use one for OneTrainer)

Attempted to train a simple Illustrious SDXL character LoRa. I reused a pre-existing image/caption folder that had been used with other training tools. The OneTrainer UI starts as expected and I selected the "#SDXL 1.0" configuration, modifying some options for this run with Illustrious.

After the "Start Training" button is selected, what appears to be pre-training setup looks OK, but then a fatal error occurs seconds after the first Step is started:

RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.

I tried downgrading PyTorch to 2.7.1+cu128 and experienced the same symptom. Also tried to use the mostly default "#SDXL 1.0" configuration settings, but still experienced the same error result.

My guess is that I've done something fundamentally wrong here, but wanted to report the Bug just in case it was more generalized
(startup+error log output pasted into the below field, with debug+config files attached)

What did you expect would happen?

I expected the step and epoch progress bars to move forward to completion, with the trained per-epoch files as a result.

Relevant log output

activating venv D:\OneTrainer\venv
Using Python "D:\OneTrainer\venv\Scripts\python.exe" -X utf8
HF_HUB_DISABLE_XET=1

NOTE: Xet disabled, to enable it set as 0 before launch
Checking Python version...
Python 3.11.9

Starting UI...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931588.588876   84084 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931590.939255   84084 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[INFO] Default settings file saved to E:\temp\tmpyukzbxje\default_settings.json
[INFO] Debug package saved to OneTrainer_debug_report.zip
Clearing cache directory D:/OneTrainer/workspace-cache/run! You can disable this if you want to continue using the same cache.
D:\OneTrainer\venv\Lib\site-packages\tensorboard\default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931705.443690   79756 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931707.931350   79756 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Selected layers: 722
Deselected layers: 72
Note: Enable Debug mode to see the full list of layer names
enumerating sample paths: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 199.80it/s]
enumerating sample paths:   0%|                                                                  | 0/1 [00:00<?, ?it/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all       | 0/35 [00:00<?, ?it/s]
TensorBoard 2.20.0 at http://localhost:6006/ (Press CTRL+C to quit)
caching: 100%|█████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00,  5.64it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 22.52it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 30/30 [00:10<00:00,  2.79it/s]
step:   0%|                                                                                     | 0/35 [01:24<?, ?it/s]
epoch:   0%|                                                                                    | 0/10 [01:33<?, ?it/s]
Traceback (most recent call last):
  File "D:\OneTrainer\modules\ui\TrainUI.py", line 719, in __training_thread_function
    trainer.train()
  File "D:\OneTrainer\modules\trainer\GenericTrainer.py", line 796, in train
    raise RuntimeError("Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.")
RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.
[INFO] Default settings file saved to C:\temp\default_settings.json
[INFO] Debug package saved to OneTrainer_debug_report.zip

Generate and upload debug_report.log

config.json
config_diff.txt
debug_report.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions