New model: Microsoft Lens by dxqb · Pull Request #1510 · Nerogar/OneTrainer

dxqb · 2026-06-06T12:33:40Z

Test in preview branch: https://github.com/Nerogar/OneTrainer/tree/preview

Summary

This PR adds Microsoft Lens: https://huggingface.co/microsoft/Lens

I'll leave this here as a draft for anyone who wants to experiment with this model, because it'll take a while until merge.

Lens is not part of diffusers yet: [New Model] Add LensPipeline （microsoft/Lens） huggingface/diffusers#13837
It uses Microsoft's upstream code instead: https://github.com/microsoft/Lens
That currently requires transformers v5. Transformers v5 breaks other models: Upgrade transformers to 5.9 #1506

Open points:

code review incomplete, there is still unchecked AI code
Claude: Open point — timestep-shift schedule doesn't match the paper's training μ (§2.3)

Includes #1506 and #1509

Test plan

pre-commit run --all-files passes
Launched the affected UI or script and exercised the change
Tested with at least one real preset / config when relevant (note which: Lens)

AI assistance

Early AI prototype — opened for discussion, not ready for review

…model composition in ModelType - Gradient checkpointing and layer offloading are now configured per component (text encoder, transformer, VAE) rather than globally - ModelType centralizes model composition and training method associations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…16 (Nerogar#147…" This reverts commit 574ec55.

Introduces OnDemandModule, a persistent delegating proxy for text encoders that must be loaded on demand and freed after use rather than parked on the CPU temp device. Adds load_on_demand per-component config and four text_encoder_N_on_demand() resolvers in TrainConfig. BaseModel.to(device) is removed as an abstract method; release() is now the sole abstract method for parking a model. Each concrete model reads self.train_config.temp_device directly. Call sites in modelSetup, dataLoader, trainer, and SampleWindow are updated to model.release(). Co-Authored-By: dxqb <183307934+dxqb@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…om/Nerogar/OneTrainer into lens_base

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Silvicultor · 2026-06-07T09:37:39Z

Did a bit of testing with this PR:
What works:

Basic LoRA training.
Sampling.

What doesn’t work:

Torch compile, unlike a similar bug with Anima PR this one is already triggered by compiling transformer blocks (not the optimizer). Attached the error log.
Loading created LoRA weights in ComfyUI.

Loss starts on a pretty high level (~0.6). A longer run would be needed to see if that is a problem or not. But to do a longer run I would want common inference tools support the LoRA format first.

Traceback (most recent call last):
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1883, in _compile
    raise_unimplemented_cache_limit_exceeded()
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1867, in raise_unimplemented_cache_limit_exceeded
    unimplemented(
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/_dynamo/exc.py", line 657, in unimplemented
    raise Unsupported(msg, gb_type, skip_frame)
torch._dynamo.exc.Unsupported: Dynamo recompile limit exceeded
  Explanation: Dynamo attempted to recompile the code object too many times, exceeding the recompile_limit cache size limit (currently set to 8). Excessive recompilations can degrade performance due to the compilation overhead of each recompilation.
  Hint: To monitor recompilations, enable TORCH_LOGS=recompiles. If recompilations are expected, consider increasing torch._dynamo.config.recompile_limit to an appropriate value.
  Hint: See https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html for tips on dealing with recompilations.

  Developer debug context: Limit type: recompile_limit

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0039.html

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/AI/OT_Lens/OneTrainer/modules/ui/TrainUI.py", line 739, in __training_thread_function
    trainer.train()
  File "/home/user/AI/OT_Lens/OneTrainer/modules/trainer/GenericTrainer.py", line 756, in train
    loss.backward()
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward
    torch.autograd.backward(
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 379, in backward
    _engine_run_backward(
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 882, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 314, in backward
    outputs = ctx.run_function(*detached_inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/modules/util/checkpointing_util.py", line 124, in __checkpointing_forward
    output = self.orig_forward(*args) if self.checkpoint is None else self.checkpoint(*args)
                                                                      ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1047, in compile_wrapper
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 2474, in __call__
    result = self._torchdynamo_orig_backend(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 736, in __call__
    result = _compile(
             ^^^^^^^^^
  File "/home/user/AI/OT_Lens/OneTrainer/venv/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1890, in _compile
    raise FailOnRecompileLimitHit(
torch._dynamo.exc.FailOnRecompileLimitHit: Hard failure due to fullgraph=True
Creating Backup workspace/run/backup/2026-06-07_11-12-09-backup-3-0-3
Saving models/model.safetensors

dxqb · 2026-06-07T11:34:29Z

Did a bit of testing with this PR

torch.compile: that's a torch bug, introduced in torch 2.12: pytorch/pytorch#186537
will need a workaround as long as we use torch 2.12. You could downgrade to torch 2.11 in the meantime.
Comfy: Lens is new - when I wrote this code, Comfy didn't even support the model at all yet. Will take care of that when the PR is close to merge.
high loss: that's normal for the Flux2 VAE, which Lens uses.

Since torch 2.12, dynamo config overrides are stored in ContextVars (pytorch/pytorch#173568), which threads don't inherit. The recompile limit raised by init_compile() therefore no longer applied to the autograd worker threads, which recompute checkpointed modules during the backward pass - training crashed with FailOnRecompileLimitHit once enough input shapes accumulated, despite the raised limit. Call init_compile() in OffloadCheckpointLayer.__checkpointing_forward so the recomputation thread sets its own config. The non-offload CheckpointLayer is unaffected: it is compiled as a whole, so its recomputation runs inside the compiled backward without re-entering dynamo. The Mod.eval patch is applied once at import time instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Updated comments regarding the handling of negative numbers in Mod.eval and related issues.

dxqb · 2026-06-07T12:30:42Z

torch.compile: that's a torch bug, introduced in torch 2.12: pytorch/pytorch#186537 will need a workaround as long as we use torch 2.12. You could downgrade to torch 2.11 in the meantime.

Merged the bug workaround branch into this branch #1511
I didn't see the error even before this workaround, so if you could re-test that would be useful. Quite sure it's fixed though.

Silvicultor · 2026-06-07T12:58:09Z

torch.compile: that's a torch bug, introduced in torch 2.12: pytorch/pytorch#186537 will need a workaround as long as we use torch 2.12. You could downgrade to torch 2.11 in the meantime.

Merged the bug workaround branch into this branch #1511 I didn't see the error even before this workaround, so if you could re-test that would be useful. Quite sure it's fixed though.

Yes, the workaround helps. Now both transformer blocks as well as the optimizer can be compiled.

Move the per-callsite checkpointing_or_offloading_enabled() guard into enable_checkpointing() itself, so every Base*Setup can call enable_checkpointing_for_* unconditionally. Also extend the central gate to allow a compile-only path (no checkpointing/offloading, but still per-layer torch.compile wrapping) when config.compile is set. Three direct diffusers enable_gradient_checkpointing() calls (SD/SDXL unet, Wuerstchen v2 prior) keep their explicit guard since they bypass this central mechanism.

…odels The gating now lives inside enable_checkpointing() after the split-offload merge, so the per-callsite guard here is redundant.

# Conflicts: # modules/modelSetup/BaseErnieSetup.py # modules/modelSetup/BaseWuerstchenSetup.py # modules/util/checkpointing_util.py # training_presets/#flux2 LoRA 8GB.json

# Conflicts: # modules/modelSetup/BaseErnieSetup.py # modules/modelSetup/BaseWuerstchenSetup.py

Mirrors upstream commit 75a44d2, which converted the rest of the codebase from the trailing factory.register() call to the @factory.register decorator form.

# Conflicts: # modules/modelLoader/mixin/HFModelLoaderMixin.py # requirements-global.txt

Mirrors Nerogar#1520 (Ernie): tokens_mask/text_encoder_hidden_state are fixed at PROMPT_MAX_LENGTH after EncodeLensText's crop, so caching wastes disk space on padding for shorter prompts. Prune before caching, pad back to PROMPT_MAX_LENGTH on load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Audit fixes applied during merge: - BaseLensSetup.py: 3/4-arg autocast helpers (drop stale weight_list arg) - Lens{FineTune,LoRA}Setup.py: latent_caching -> image_caching/text_caching - LensBaseDataLoader.py: latent_caching -> text_caching - BaseModelTabView/BaseTrainingTabView/TopBarController/BaseConvertModelUIView: wire up Lens UI (no text-encoder layer offloading, GPT-OSS is always on-demand/MXFP4) - test/run_lora_presets.sh: add Lens LoRA preset - requirements-global.txt: pin mgds to combined anima+lens+ideogram preview branch (6f6518a)

dxqb and others added 8 commits May 25, 2026 18:11

Merge branch 'master' into split-offload

5a41835

Revert "Revert "Upgrade transformers to 5.9 and huggingface-hub to 1.…

2f0620b

…16 (Nerogar#147…" This reverts commit 574ec55.

Merge branch 'upstream' into split-offload

0b4ddc4

Merge branch 'revert-1504-revert-transformers-v5' of https://github.c…

3a66eca

…om/Nerogar/OneTrainer into lens_base

Merge branch 'ondemand-base' into lens_base

cd122f4

Add Lens model (LoRA + Fine-Tune training + sampling)

0fffc53

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dxqb and others added 3 commits June 7, 2026 14:12

Refine comments on Mod.eval negative number handling

31285e8

Updated comments regarding the handling of negative numbers in Mod.eval and related issues.

Merge branch 'recompile-limit' into lens

92f612e

dxqb added the preview merged in the preview branch label Jun 13, 2026

dxqb added 4 commits June 13, 2026 17:35

Merge branch 'upstream' into split-offload

15f40a8

Merge branch 'split-offload' into lens

18d6231

Make BaseLensSetup checkpointing call unconditional, matching other m…

cf5b106

…odels The gating now lives inside enable_checkpointing() after the split-offload merge, so the per-callsite guard here is redundant.

dxqb added a commit that referenced this pull request Jun 14, 2026

Merge PR #1510 (New model: Microsoft Lens) into preview

90a7245

dxqb mentioned this pull request Jun 14, 2026

Upgrade transformers to 5.9 #1506

Closed

dxqb and others added 8 commits June 18, 2026 01:20

Merge remote-tracking branch 'Nerogar/master' into lens_base

c8a267b

# Conflicts: # modules/modelSetup/BaseErnieSetup.py # modules/modelSetup/BaseWuerstchenSetup.py # modules/util/checkpointing_util.py # training_presets/#flux2 LoRA 8GB.json

Merge branch 'lens_base' into lens

9073a01

# Conflicts: # modules/modelSetup/BaseErnieSetup.py # modules/modelSetup/BaseWuerstchenSetup.py

Use decorator form for factory.register() in Lens model files

58879e8

Mirrors upstream commit 75a44d2, which converted the rest of the codebase from the trailing factory.register() call to the @factory.register decorator form.

Merge branch 'master' into lens

fec3db0

Merge remote-tracking branch 'Nerogar/master' into lens_base

9511826

# Conflicts: # modules/modelLoader/mixin/HFModelLoaderMixin.py # requirements-global.txt

Merge branch 'lens_base' into lens

4970f46

Merge remote-tracking branch 'origin/lens' into lens

f27ce27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New model: Microsoft Lens#1510

New model: Microsoft Lens#1510
dxqb wants to merge 23 commits into
Nerogar:masterfrom
dxqb:lens

dxqb commented Jun 6, 2026 •

edited

Loading

Uh oh!

Silvicultor commented Jun 7, 2026

Uh oh!

dxqb commented Jun 7, 2026

Uh oh!

dxqb commented Jun 7, 2026 •

edited

Loading

Uh oh!

Silvicultor commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dxqb commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

AI assistance

Uh oh!

Silvicultor commented Jun 7, 2026

Uh oh!

dxqb commented Jun 7, 2026

Uh oh!

dxqb commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Silvicultor commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dxqb commented Jun 6, 2026 •

edited

Loading

dxqb commented Jun 7, 2026 •

edited

Loading