Enhancing Video Colorization with Deep Learning: A Comprehensive Analysis of Training Loss Functions
This repository contains research on the use of deep neural networks for the automatic colorization of black-and-white videos. The approach extends image colorization techniques to video by employing a autoencoder with U-Net architecture based to predict denoised and colorized frames. Training was conducted using the DAVIS dataset, where a number of loss function combinations were tested to identify the optimal configuration for maintaining object and structure integrity while achieving high-quality colorization.
The article can be found here: Enhancing Video Colorization with Deep Learning: A Comprehensive Analysis of Training Loss Functions
torch>= 1.13torchvision>= 0.4cuda>= 11.6vit Pytorch>= 0.40.2
This section displays qualitative results through images of various loss function combinations, highlighting visual quality and artifacts. Quantitative results are summarized with metrics like SSIM, PSNR, and LPIPS, providing a detailed performance evaluation of each configuration.
| Gray Frame input | MAE + SSIM | MSE + SSIM | MAE + Content | MSE + Content |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
| Reference Frame | MAE + LPIPS | MSE + LPIPS | MAE + Perceptual | MSE + Perceptual |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
| Number of Loss Functions | Training Loss Functions | SSIM ↑ | PSNR ↑ | LPIPS ↑ |
|---|---|---|---|---|
| Single | MSE | 0.970 | 45.460 | 0.021 |
| Single | MAE | 0.970 | 44.225 | 0.023 |
| Two | MAE + SSIM | 0.967 | 41.944 | 0.024 |
| Two | MSE + SSIM | 0.973 | 48.099 | 0.021 |
| Two | MAE + Content | 0.972 | 46.999 | 0.022 |
| Two | MSE + Content | 0.953 | 33.981 | 0.041 |
| Two | MSE + LPIPS | 0.962 | 40.570 | 0.028 |
| Two | MAE + LPIPS | 0.965 | 42.585 | 0.026 |
| Two | MAE + Perceptual | 0.967 | 43.570 | 0.023 |
| Two | MSE + Perceptual | 0.970 | 46.496 | 0.020 |
| Three | MAE + SSIM + Perceptual | 0.975 | 49.532 | 0.026 |
| Three | MSE + SSIM + Perceptual | 0.975 | 49.532 | 0.026 |
| Three | MSE + LPIPS + SSIM | 0.974 | 49.132 | 0.024 |
| Three | MAE + LPIPS + SSIM | 0.968 | 44.883 | 0.023 |
| Three | MAE + LPIPS + Content | 0.961 | 39.772 | 0.030 |
| Three | MSE + LPIPS + Content | 0.970 | 47.040 | 0.019 |
| Three | MAE + SSIM + LPIPS | 0.966 | 40.183 | 0.026 |
| Three | MSE + SSIM + LPIPS | 0.974 | 48.691 | 0.024 |
| Three | MAE + SSIM + Style | 0.971 | 43.463 | 0.025 |
| Three | MSE + SSIM + Style | 0.974 | 47.653 | 0.023 |
Two datasets are utilized in the training process, the DAVIS 2017 (Densely Annotated VIdeo Segmentation) to training the weights and to validate the results of the model.
The input to colorize inference needs to be an monocromatic video and an example frame (preference of this video).
The code will resize and normalize the frames to predict the color. At the end, the video colorized will be saved at the videos_output folder.
To evalute the mode execute the file main.py and put in the variable str_dt one of the model name in the folder trained_models.
Also, if you want to train yout own model using the losses combination you just need to run the loop_train_all_losses.py, this script execute the train.py for each loss combionation defined in the criterions list.
@InProceedings{stival2024enhancing,
author="Stival, Leandro and da Silva Torres, Ricardo and Pedrini, Helio",
editor="Arai, Kohei",
title="Enhancing Video Colorization with Deep Learning: A Comprehensive Analysis of Training Loss Functions",
booktitle="Intelligent Systems and Applications",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="496--509",
isbn="978-3-031-66329-1"
}









