[Bug]: "RuntimeError: Training loss became NaN"

### What happened?

Hi, first-time OneTrainer user, installed using the install.bat on this system:

Windows 10, 64GB RAM, RTX-3090 (two GPUs, but only use one for OneTrainer)

Attempted to train a simple Illustrious SDXL character LoRa.  I reused a pre-existing image/caption folder that had been used with other training tools. The OneTrainer UI starts as expected and I selected the "#SDXL 1.0"  configuration, modifying some options for this run with Illustrious.

After the "Start Training" button is selected, what appears to be pre-training setup looks OK, but then a fatal error occurs seconds after the first Step is started:

**RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.**

I tried downgrading PyTorch to 2.7.1+cu128 and experienced the same symptom.  Also tried to use the mostly default "#SDXL 1.0" configuration settings, but still experienced the same error result.

My guess is that I've done something fundamentally wrong here, but wanted to report the Bug just in case it was more generalized
_(startup+error log output pasted into the below field, with debug+config files attached)_

### What did you expect would happen?

I expected the step and epoch progress bars to move forward to completion, with the trained per-epoch files as a result.

### Relevant log output

```shell
activating venv D:\OneTrainer\venv
Using Python "D:\OneTrainer\venv\Scripts\python.exe" -X utf8
HF_HUB_DISABLE_XET=1

NOTE: Xet disabled, to enable it set as 0 before launch
Checking Python version...
Python 3.11.9

Starting UI...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931588.588876   84084 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931590.939255   84084 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[INFO] Default settings file saved to E:\temp\tmpyukzbxje\default_settings.json
[INFO] Debug package saved to OneTrainer_debug_report.zip
Clearing cache directory D:/OneTrainer/workspace-cache/run! You can disable this if you want to continue using the same cache.
D:\OneTrainer\venv\Lib\site-packages\tensorboard\default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931705.443690   79756 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773931707.931350   79756 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Selected layers: 722
Deselected layers: 72
Note: Enable Debug mode to see the full list of layer names
enumerating sample paths: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 199.80it/s]
enumerating sample paths:   0%|                                                                  | 0/1 [00:00<?, ?it/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all       | 0/35 [00:00<?, ?it/s]
TensorBoard 2.20.0 at http://localhost:6006/ (Press CTRL+C to quit)
caching: 100%|█████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00,  5.64it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 22.52it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 30/30 [00:10<00:00,  2.79it/s]
step:   0%|                                                                                     | 0/35 [01:24<?, ?it/s]
epoch:   0%|                                                                                    | 0/10 [01:33<?, ?it/s]
Traceback (most recent call last):
  File "D:\OneTrainer\modules\ui\TrainUI.py", line 719, in __training_thread_function
    trainer.train()
  File "D:\OneTrainer\modules\trainer\GenericTrainer.py", line 796, in train
    raise RuntimeError("Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.")
RuntimeError: Training loss became NaN. This may be due to invalid parameters, precision issues, or a bug in the loss computation.
[INFO] Default settings file saved to C:\temp\default_settings.json
[INFO] Debug package saved to OneTrainer_debug_report.zip
```

### Generate and upload debug_report.log

[config.json](https://github.com/user-attachments/files/26120888/config.json)
[config_diff.txt](https://github.com/user-attachments/files/26120886/config_diff.txt)
[debug_report.log](https://github.com/user-attachments/files/26120887/debug_report.log)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: "RuntimeError: Training loss became NaN" #1380

What happened?

What did you expect would happen?

Relevant log output

Generate and upload debug_report.log

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: "RuntimeError: Training loss became NaN" #1380

Description

What happened?

What did you expect would happen?

Relevant log output

Generate and upload debug_report.log

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions