[Bug]: "RuntimeError: Training loss became NaN" #1384

ooofest · 2026-03-19T15:58:16Z

ooofest
Mar 19, 2026

What happened?

Hi, first-time OneTrainer user, installed using the install.bat on this system:

Windows 10, 64GB RAM, RTX-3090 (two GPUs, but only use one for OneTrainer)

Attempted to train a simple Illustrious SDXL character LoRa. I reused a pre-existing image/caption folder that had been used with other training tools. The OneTrainer UI starts as expected and I selected the "#SDXL 1.0" configuration, modifying some options for this run with Illustrious.

After the "Start Training" button is selected, what appears to be pre-training setup looks OK, but then a fatal error occurs seconds after the first Step is started:

RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.

I tried downgrading PyTorch to 2.7.1+cu128 and experienced the same symptom. Also tried to use the mostly default "#SDXL 1.0" configuration settings, but still experienced the same error result.

My guess is that I've done something fundamentally wrong here, but wanted to report the Bug just in case it was more generalized
(startup+error log output pasted into the below field, with debug+config files attached)

What did you expect would happen?

I expected the step and epoch progress bars to move forward to completion, with the trained per-epoch files as a result.

Relevant log output

activating venv D:\OneTrainer\venv
Using Python "D:\OneTrainer\venv\Scripts\python.exe" -X utf8
HF_HUB_DISABLE_XET=1

NOTE: Xet disabled, to enable it set as 0 before launch
Checking Python version...
Python 3.11.9

Starting UI...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931588.588876   84084 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931590.939255   84084 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[INFO] Default settings file saved to E:\temp\tmpyukzbxje\default_settings.json
[INFO] Debug package saved to OneTrainer_debug_report.zip
Clearing cache directory D:/OneTrainer/workspace-cache/run! You can disable this if you want to continue using the same cache.
D:\OneTrainer\venv\Lib\site-packages\tensorboard\default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931705.443690   79756 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931707.931350   79756 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Selected layers: 722
Deselected layers: 72
Note: Enable Debug mode to see the full list of layer names
enumerating sample paths: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 199.80it/s]
enumerating sample paths:   0%|                                                                  | 0/1 [00:00<?, ?it/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all       | 0/35 [00:00<?, ?it/s]
TensorBoard 2.20.0 at http://localhost:6006/ (Press CTRL+C to quit)
caching: 100%|█████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00,  5.64it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 22.52it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 30/30 [00:10<00:00,  2.79it/s]
step:   0%|                                                                                     | 0/35 [01:24<?, ?it/s]
epoch:   0%|                                                                                    | 0/10 [01:33<?, ?it/s]
Traceback (most recent call last):
  File "D:\OneTrainer\modules\ui\TrainUI.py", line 719, in __training_thread_function
    trainer.train()
  File "D:\OneTrainer\modules\trainer\GenericTrainer.py", line 796, in train
    raise RuntimeError("Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.")
RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.
[INFO] Default settings file saved to C:\temp\default_settings.json
[INFO] Debug package saved to OneTrainer_debug_report.zip

Generate and upload debug_report.log

config.json
config_diff.txt
debug_report.log

dxqb · 2026-03-19T20:56:01Z

dxqb
Mar 19, 2026
Collaborator

converting to discussion - this is likely not a bug but caused by your (quite unusual) choice of training parameters.
you can continue to discuss those here, or join our Discord and #help

0 replies

ooofest · 2026-03-20T01:58:30Z

ooofest
Mar 20, 2026
Author

Hey @dxqb thanks for your helpful reply! Admittedly, it may have been my poor mapping of Kohya configuration settings to the .JSON that I did by hand, in these cases. That's why I also tried using the sample #SDXL 1.0 profile which came with OneTrainer from the repo, but that also led to the NaN problem.

I just looked for some sample configurations and pulled a couple of them down from Reddit - both of those led to the same Nan issue.

Also tried different base models, optimizers, etc. In looking at VRAM usage, I was reaching about half of the card's capacity when the Nan message occurs.

EDIT:
Update 1:
High-level issue found . . .
I pulled down some sample Concept training data from a training article - also for a character LoRa - and no Nan problems encountered on the system. Switching back to my original dataset, the Nan error occurs once again.

So, even though I've used my original dataset with other tools to train SDXL, Pony and Wan LoRas (with minor modifications per each), for some reason it's causing a problem for OneTrainer.

Update 2:
Recaptioned the images, but that doesn't matter. The Nan can be reproduced if:

Number of Concept (image+text file) pairs >= Training-Local Batch Size

So, with Batch size = 4, it would train fine with three of my image+text pairs in the Concept. But if I added another pair, then the Nan occurs when attempting to train. Changing the Local Batch Size value doesn't change that condition: if I set Batch size = 6, then it will train with up to five image+text pairs in the Concept, but the sixth pair will lead to the Nan condition. Looking into the images, themselves . . .

0 replies

ooofest · 2026-03-20T05:53:36Z

ooofest
Mar 20, 2026
Author

After more experimenting, it seems that the underlying issue was related to the image files in my dataset.

Files were in .PNG and enhanced in Topaz Photo, sizes ranged from 1MB to 6MB.

I resaved those originals as PNG with a different program, which ended up reducing the range of sizes to 0.9MB - 1.7MB and stripped all extra metadata.

And now OneTrainer no longer throws Nan errors for my dataset. Was it their sizes or something unexpected in the PNG format that was added by Topaz Photo? Not sure, but I could experiment more to check, if curious.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: "RuntimeError: Training loss became NaN" #1384

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

[Bug]: "RuntimeError: Training loss became NaN" #1384

Uh oh!

ooofest Mar 19, 2026

What happened?

What did you expect would happen?

Relevant log output

Generate and upload debug_report.log

Replies: 3 comments

Uh oh!

dxqb Mar 19, 2026 Collaborator

Uh oh!

Uh oh!

ooofest Mar 20, 2026 Author

Uh oh!

Uh oh!

ooofest Mar 20, 2026 Author

ooofest
Mar 19, 2026

dxqb
Mar 19, 2026
Collaborator

ooofest
Mar 20, 2026
Author

ooofest
Mar 20, 2026
Author