Skip to content

Two non-blocking issues with stable diffusion v2 pre-training template (AKS) #493

@mascharkh

Description

@mascharkh

Two non-blocking issues with the above template that seem to be addressable in other parts of the stack.

  1. In the 01_preprocessing notebook, when running the preprocess.py script, two processes show at 0% even though the process finishes and shows Finished in 94.77364039421082 seconds. It would be better to suppress the error.
Parquet Files Sample 0: 0%
0/1 [00:02<?, ?it/s]
2025-10-20 09:21:06,687	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-10-20_08-27-54_702051_216/logs/ray-data
2025-10-20 09:21:06,688	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> LimitOperator[limit=5] -> ActorPoolMapOperator[MapBatches(SDTransformer)] -> ActorPoolMapOperator[MapBatches(SDLatentEncoder)] -> TaskPoolMapOperator[Write]
(_MapWorker pid=826, ip=10.0.192.52) The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]id=826, ip=10.0.192.52) 
(_MapWorker pid=826, ip=10.0.192.52) Initialized SDLatentEncoder.
(_MapWorker pid=826, ip=10.0.192.52) Device: cpu
(_MapWorker pid=826, ip=10.0.192.52) Resolution: 512
Running: 1/24.0 CPU, 0/3.0 GPU, 44.2MB/17.8GB object_store_memory: 0%
0/1 [01:14<?, ?it/s]
  1. in the 02_train notebook, when running train.py the job finishes and says its writing checkpoints but then we see this message:
✓ Successfully uploaded checkpoints to abfss://...
But then ... TypeError: object NoneType can't be used in 'await' expression

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions