Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
339 commits
Select commit Hold shift + click to select a range
e0f9de8
Move script methods into script
Jul 16, 2025
2449db4
Cleanup eval script a bit
Jul 16, 2025
9aff703
Fully separate CRPS and MAE plots for long time ranges
Jul 16, 2025
20a2ad8
Minor improvements to plotting
Jul 16, 2025
734279f
Update paths to environment for training scripts
Jul 21, 2025
a618c2f
Make label opt argument
Jul 21, 2025
5fc44e6
Add test scripts/configs for regression, diffusion, and generation
Jul 21, 2025
284f778
Update CI to include regression test script
Jul 21, 2025
995e888
Test script should not be srun if it's going to work with ci/cd
Jul 24, 2025
ffc1bed
Try to directly run from cscs.yml
Jul 24, 2025
ebc1043
plot diurnal cycles of precip amount and wet-hour frequency over time…
leuty Jul 24, 2025
7470c99
add mlflow logging
Jul 25, 2025
df360fb
Merge remote-tracking branch 'origin/corr-diff' into corr-diff
Jul 25, 2025
0f3b0c6
mlflow init bug fix
Jul 25, 2025
3bc4cae
add option to continue same run mlflow
Jul 25, 2025
6ff06da
add separate visualization frequency setting
Jul 25, 2025
06c185f
add plots of temperature and windspeed
leuty Jul 25, 2025
02ab40a
Add script to plot the 99th all-hour percentile
leuty Jul 25, 2025
c8a3f77
add visualization logging option for mlflow
Jul 25, 2025
3d1347a
enable mlflow logging to remote server
Jul 28, 2025
678fbe9
fix logging bug for average loss
Jul 28, 2025
e7d6c6b
update readme
Jul 28, 2025
6a63bcd
fix indents in readme
Jul 28, 2025
f9b8c87
simplify cscs.yml to try to get ci/cd to run
Jul 28, 2025
a72bbfa
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Jul 28, 2025
79e3112
logs
leuty Jul 29, 2025
7a11785
mask out sea points
leuty Jul 30, 2025
e234528
Convert to xarray, still buggy
leuty Jul 30, 2025
12702b6
comment
leuty Jul 31, 2025
b537630
fix bug in std
leuty Jul 31, 2025
1873b58
add patched diffusion configs
Jul 31, 2025
3c6853b
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Jul 31, 2025
79c6f6c
small fixes
leuty Jul 31, 2025
001d5e2
New script to plot maps (#18)
leuty Aug 4, 2025
8efcedd
use more xarray
leuty Aug 4, 2025
99a0661
add histograms
leuty Aug 4, 2025
477140b
add percentile lines
leuty Aug 5, 2025
f79f95f
add plot for probability of exceedance
leuty Aug 5, 2025
e631235
label
leuty Aug 5, 2025
bc40b00
Merge remote-tracking branch 'upstream/corr-diff' into dcycles
leuty Aug 5, 2025
4904a53
rename
leuty Aug 5, 2025
58c6607
plot map of the 99th all-hour percentile
leuty Aug 5, 2025
588681a
plot mean
leuty Aug 5, 2025
3ca921c
Update model compilation
Aug 6, 2025
5212ef3
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Aug 6, 2025
f445342
turn on optimizations
Aug 6, 2025
062c8d8
add some common untility functions
leuty Aug 6, 2025
392637d
use get_channel_indices function
leuty Aug 6, 2025
4dbaae3
remove slurm account variables
Aug 6, 2025
fd8434e
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Aug 6, 2025
63bb3cb
put more constants in plotting, use land-mask from plotting
leuty Aug 6, 2025
6236e7e
re-use concat_and_group_diurnal
leuty Aug 6, 2025
489d0a6
remove unneeded function
leuty Aug 6, 2025
e2b1aec
cleanup dcycles
leuty Aug 6, 2025
3928d6e
cleanup
leuty Aug 6, 2025
0738df2
rename
leuty Aug 7, 2025
ecb686f
unify maps in one script and more stats
leuty Aug 7, 2025
622a36c
point to store
leuty Aug 7, 2025
42250be
conserve memory and cleanup
leuty Aug 7, 2025
df1bf72
Merge pull request #19 from leuty/dcycles
marymcglo Aug 13, 2025
66f429e
Merge branch 'main' into corr-diff
Aug 20, 2025
5990c98
Merge branch 'main' into corr-diff
Aug 20, 2025
2729654
remove dependencies from pyproject
Aug 25, 2025
8b53777
change conversion factor to get mm instead of cm
Aug 25, 2025
9fe0b69
add __init__ package for input_data
Aug 25, 2025
630086a
Clean up copernicus processing
Aug 25, 2025
a7a541f
Generalize copernicus script to work with grib/netcdf
Aug 26, 2025
4a963ee
add copy anemoi script for copying from catalogue
Aug 26, 2025
b984129
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Aug 26, 2025
fcef8fc
move anemoi script to input_data directory
Aug 26, 2025
41208b2
Set training parameters in configs; add input/output channel names se…
Aug 26, 2025
c6c52e7
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Aug 26, 2025
d28a8be
Enhance evaluation scripts to include regression predictions
Aug 26, 2025
5a5412e
one-off script to regrid copernicus data
Aug 27, 2025
7871876
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Aug 27, 2025
f1d7101
Add input_data dependencies to container
Aug 27, 2025
3642980
add grid variable
Aug 27, 2025
fe621ac
more on processing copernicus data
Sep 1, 2025
4aecb65
removing old file
Sep 1, 2025
5d12c4c
Log config into training script logs
Sep 1, 2025
0a28801
Make InfiniteSampler start from where it left off when resuming train…
Sep 9, 2025
19cc78b
Move model compilation after checkpoint loading and fix checkpoint to…
Sep 10, 2025
1d4865f
Refactor image_batching and image_fuse to handle input tensor dtype c…
Sep 10, 2025
4b574e8
Validate max_patch_per_gpu against batch_size_per_gpu
Sep 10, 2025
40d1792
Add method to load static variables
Sep 16, 2025
4d6b9cb
Add static variable config
Sep 16, 2025
ef4f82a
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Sep 16, 2025
bfe2b35
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Sep 16, 2025
3ed02fa
Add a script to check input data for missing/corrupt/nan data
Sep 18, 2025
9d858eb
Enhance ERA5_COSMO dataset class with static channels option and Box-…
Sep 19, 2025
caaa424
Add lead time label param in RegressionLoss for consistency and refac…
Sep 19, 2025
c2e3160
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Sep 19, 2025
7ecc3d9
fix bug in dataset initialization
Sep 19, 2025
b9a843e
Add a less heavyweight Dockerfile that doesn't use nvidia stuff, for …
Sep 24, 2025
3a1ae7c
Modifications to input data tests to optionally load the torch files
Sep 24, 2025
e77c61a
Scripts used to reprocess data for various runs
Sep 24, 2025
66e3fc9
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Sep 24, 2025
88dcc01
Add reprocessing script
Oct 2, 2025
425050f
add functionality to have hour of the day and month of the year embed…
Oct 2, 2025
d8090c3
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Oct 2, 2025
6a09fb6
update config to latest features
Oct 2, 2025
ba08752
remove scoringrules packeage, not installed
leuty Oct 8, 2025
febbfe8
put snapshot maps into own sbatch script
leuty Oct 8, 2025
778f52b
Eval More 10m winds (#23)
leuty Oct 9, 2025
9783e10
first draft of regridding REA-L-CH1
Oct 20, 2025
2d0d0f8
remove unused function
Oct 20, 2025
2aaa067
add conversion to geo coords
Oct 20, 2025
608b988
use plotting function from interpolate_baisic
Oct 21, 2025
705e652
pip install meteodata-lab
Oct 21, 2025
3a54d18
config file for REA-L-CH1
Oct 21, 2025
7e06e91
skeleton for realch1 interpolatoin task
Oct 21, 2025
3dcb088
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Oct 21, 2025
1829e93
Enhance generation process with randomized sampler over time.
Oct 22, 2025
019c3a6
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Oct 22, 2025
ae6d8ff
Wind eval streaming (#24)
leuty Oct 27, 2025
68c3ae8
some updates to regridding (still not working)
Oct 28, 2025
f5a4f79
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Oct 28, 2025
603f3d8
regridding is actually working now.
Oct 28, 2025
87c5206
skeleton of full regrid script
Oct 28, 2025
dd79f5e
Regridding REA-L-CH1 script complete
Oct 28, 2025
a5acccd
Add trim_edge functionality to regridding realch1
Oct 29, 2025
93b5a5f
do rea-l-ch1 regridding in batches to improve performance/memory usage
Oct 30, 2025
9f42baa
small refactoring and cleanup so it is easier to re-use era5-cosmo in…
Oct 30, 2025
7b9364f
bit more skeleton to new inteprolation
Nov 3, 2025
d2ca5e5
working script
Nov 4, 2025
d0b9c1f
changes to copernicus regrid
Nov 5, 2025
c9b0f37
config updates
Nov 5, 2025
72ecbde
updates to CRPS eval
Nov 7, 2025
0047a68
fix shading
leuty Nov 10, 2025
eb6bf2c
remove unused libraries
Nov 11, 2025
01e367c
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Nov 11, 2025
b59425a
fix mlflow logging of learning rate
Nov 14, 2025
b227057
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Nov 14, 2025
aa1e1a0
hdf5
Nov 17, 2025
c34ea08
cleanup hdf5
Nov 17, 2025
886949f
lmdb
Nov 17, 2025
b9a45aa
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Nov 18, 2025
e07ec16
corrected dimensions in hdf db (and new path)
Nov 18, 2025
50e209a
un-comment the write db commands
Nov 18, 2025
6b34c32
Make cosmo dataset optional.
Nov 18, 2025
853fca0
update era location
Nov 18, 2025
e4af6a5
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Nov 18, 2025
3fd375d
correct dims for output grid
Nov 18, 2025
e91ad57
add bunch of configs for batches
Nov 20, 2025
249bab5
correct some things
Nov 20, 2025
e845276
remove unused script
Nov 20, 2025
a2245f1
add interpolate script
Nov 20, 2025
564a028
oops commented line in the middle of the script isn't good
Nov 20, 2025
4e85117
refactor interpolate_basic script to be less complex
Nov 20, 2025
4c5bee5
add balfrin scripts
Nov 20, 2025
fe4f920
Merge branch 'interpolate' of https://github.com/MeteoSwiss/HiRAD-Gen…
Nov 20, 2025
b33ae25
Add comments and make interpolate_basic a bit more generalizable.
Nov 25, 2025
88a92df
remove ensemble dimension from processing
Nov 25, 2025
8480de4
separate out era5-cosmo specific main method into its own script
Nov 25, 2025
a45c4d7
correct error in anemoi open dataset
Nov 26, 2025
1c80ed8
wrong arg name
Nov 26, 2025
6654769
small fixes, script runs now
Nov 26, 2025
bc44d94
add methods for netcdf interpolation (draft)
Nov 26, 2025
054bce7
mostly working netcdf regridding refactor. However, there is an off-b…
Nov 27, 2025
66b6dba
fixed off-by-one error
Nov 27, 2025
823663c
cleanup and comments
Nov 27, 2025
fbc4eb5
add logging, and small refactor of regrid_realc1 to more closely matc…
Dec 4, 2025
334bd54
a few processing scripts
Dec 4, 2025
ea07ee6
start of era5->realch1 processing script
Dec 4, 2025
f208f45
add a bunch of configs
Dec 4, 2025
526a30a
fix bug in channel indexing in dataset
Dec 5, 2025
ed71054
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Dec 5, 2025
b7b6bcc
add dependencies (attempt)
Dec 8, 2025
64b0982
attempted fix
Dec 9, 2025
7310bda
regridding dependencies and bug fixes
Dec 11, 2025
fb7bc9e
stash some stuff
Dec 16, 2025
37aa3fa
add static and stats calc to ralch1
Dec 17, 2025
836f223
reorg files
Dec 17, 2025
6f1b0ed
adapt snapshots to be able to select channels to plot
Dec 19, 2025
42eed90
fix bug in snapshots
Dec 19, 2025
a8e3345
update dataset for realch1
Dec 19, 2025
f1b678a
update plotting to use new channel metadata to string functions
Dec 19, 2025
44b33ac
add script to make video from snapshots
Dec 23, 2025
aa888eb
add script to calculate datasets stats after transforming data
Dec 23, 2025
858020d
enable torch compile during distributed training
Dec 24, 2025
8a3f63e
enable time logging of data fetching
Dec 24, 2025
baf645f
add realch1 dataset training configs
Dec 24, 2025
a7f1671
Refactor evaluation scripts for improved configuration handling and u…
Jan 5, 2026
bbb8dea
changed config files
Jan 5, 2026
f191b5c
modify dataset configs
Jan 5, 2026
21f6234
clean up configs further
Jan 5, 2026
9a706f0
add config for cosmo model evaluation
Jan 6, 2026
2d88c82
Merge branch 'interpolate' into corr-diff. Most of this will be obsol…
Jan 14, 2026
643b9c8
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Jan 15, 2026
3300354
update base image to the one Fabian gave us that has the correct entr…
Mar 5, 2026
7f93f07
update base image to the CSCS one with correct entrypoint
Mar 5, 2026
a3d6216
add varialbes so runner can conform to launch constraints listed at h…
Mar 5, 2026
7e1e943
Merge pull request #26 from MeteoSwiss/hirad-50
marymcglo Mar 9, 2026
ba58584
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Mar 10, 2026
6a24532
add class for irregular grid interpolation
Jan 14, 2026
1e8eb11
enable interpolation on gpu-s using pytorch
Jan 14, 2026
726cbd6
adjust netcdf interpolation to use GridData (for copernicus data proc…
Jan 19, 2026
154e6cd
draft of anemoi-based data loader
Jan 21, 2026
3d09f5b
fix anemoi dataset to be comaptible with training loop
Jan 26, 2026
f363d0f
extend custom interpolation to be able to interpolate batches of data
Jan 26, 2026
f7fb4ff
enable validation with anemoi dataset
Jan 27, 2026
93c5715
Take static channels and date embedding appending logic out of datase…
Feb 6, 2026
e1867fb
Optimize wind and precip stats calculations
Feb 19, 2026
e570193
Fix a bug in diurnal stats grouping
Feb 19, 2026
cf16740
Update to first interpolating and then transforming and normalizing data
Feb 19, 2026
12e6060
clean up
Mar 12, 2026
6aaef65
Add anemoi datasest without copernicus tp data
Mar 12, 2026
af0c9ca
fix bug in validation low res image flipping
Mar 12, 2026
267f027
clean up models
Mar 17, 2026
d9ae1dc
update environment
Mar 17, 2026
7b79229
update sbatch scripts and configs for training, generation and valida…
Mar 17, 2026
430d8e3
old dataset path fix
Mar 17, 2026
c380b90
add tests for utils, losses and part of models
Mar 18, 2026
52ab20a
readme update
Mar 18, 2026
af8725c
add example showcase to README
Mar 18, 2026
3676f14
add tests to CI check
Mar 18, 2026
85fd379
build CI docker image from cscs base image
Mar 18, 2026
3f8a327
dir location attempt
Mar 18, 2026
40e184d
another attempt to find code in dockerfile
Mar 18, 2026
4c88aff
code found
Mar 18, 2026
c192576
small bug
Mar 18, 2026
1b7f347
pls work
Mar 18, 2026
c7ab492
workdir issue
Mar 18, 2026
c097327
increase memory of container build
Mar 19, 2026
0f276a1
remove editable hirad installation
Mar 19, 2026
6abb782
copy sourcecode to tmp
Mar 19, 2026
575ef23
activate unit tests
Mar 19, 2026
b322a54
fix path
Mar 19, 2026
9d06b8a
check current dir
Mar 19, 2026
f353ada
try to move to /opt
Mar 19, 2026
d1da8b1
workdir attempt
Mar 19, 2026
ac07a8b
add tests to CI check
Mar 18, 2026
466faa8
Merge branch 'streaming-interpolation' of https://github.com/MeteoSwi…
Mar 19, 2026
f199fba
Merge pull request #27 from MeteoSwiss/streaming-interpolation
PetarStam Mar 19, 2026
cd9a13c
update .gitignore
Mar 19, 2026
0af1ea6
Merge branch 'corr-diff' of https://github.com/MeteoSwiss/HiRAD-Gen i…
Mar 19, 2026
0b552fd
update unet and tests
Mar 19, 2026
65572b6
add preconditioning model tests
Mar 20, 2026
be56284
add Song UNet tests
Mar 23, 2026
f6a6d90
fix errors in tests
Mar 23, 2026
88855e9
change var calculation in group norm eval to unbiased to match pytorc…
Mar 24, 2026
83fc415
add tests for nn layers
Mar 24, 2026
41876cd
refactor training script to bemore readable and extendable
Mar 31, 2026
50545b5
tests for training and small fixes
Apr 2, 2026
23a7add
switch skip_scale fromnumpy to float
Apr 2, 2026
ead9539
add torch inference mode context
Apr 2, 2026
1ebe566
change CI config to santis
Apr 2, 2026
0400085
add anemoi-datasets to container
Apr 2, 2026
db4f274
use torch compile by default in inference
Apr 2, 2026
88967cd
Merge pull request #28 from MeteoSwiss/unit-tests
PetarStam Apr 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,8 @@ pyrightconfig.json
*.torch
plots/*
*.npz
outputs/*
logs/*

# conda
.conda/*
Expand All @@ -183,7 +185,6 @@ plots/*
temp.*

# local script
interpolate.sh
cosmo-grid.npz
*.out
.conda/*
Expand All @@ -194,7 +195,9 @@ temp.zarr.sync*
src/hirad/eval/__pycache__/*
interpolate_basic.log
interpolated.torch
mlruns/
.secrets.env
out
core
*.png
*.nc
*.nc
*.err
261 changes: 171 additions & 90 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,121 +2,211 @@

HiRAD-Gen is short for high-resolution atmospheric downscaling using generative models. This repository contains the code and configuration required to train and use the model.

## Installation (Alps)
[Showcase](#Showcase)
[Setup - clariden/santis](#setup-claridensantis)
[Inference - clariden/santis](#running-inference-on-alps)
[Regression training - clariden/santis](#run-regression-model-training-alps)
[Diffusion training - clariden/santis](#run-diffusion-model-training-alps)

## Showcase

<table>
<tr>
<th width="5%"></th>
<th width="31%">Input ERA5</th>
<th width="31%">Prediction</th>
<th width="31%">Target REAL-CH1</th>
</tr>
<tr>
<td><b>2t</b></td>
<td><img src="docs/images/showcase/2t-input.png" width="100%"></td>
<td><img src="docs/images/showcase/2t-pred.png" width="100%"></td>
<td><img src="docs/images/showcase/2t-target.png" width="100%"></td>
</tr>
<tr>
<td><b>10u</b></td>
<td><img src="docs/images/showcase/10-input.png" width="100%"></td>
<td><img src="docs/images/showcase/10u-pred.png" width="100%"></td>
<td><img src="docs/images/showcase/10-target.png" width="100%"></td>
</tr>
<tr>
<td><b>10v</b></td>
<td><img src="docs/images/showcase/10v-input.png" width="100%"></td>
<td><img src="docs/images/showcase/10v-pred.png" width="100%"></td>
<td><img src="docs/images/showcase/10v-target.png" width="100%"></td>
</tr>
<tr>
<td><b>tp</b></td>
<td><img src="docs/images/showcase/tp-input.png" width="100%"></td>
<td><img src="docs/images/showcase/tp-pred1.png" width="100%"></td>
<td><img src="docs/images/showcase/tp-target.png" width="100%"></td>
</tr>
</table>

### Ensemble Total Preceipitatin 1h

<table>
<tr>
<td width="50%"><img src="docs/images/showcase/tp-pred1.png" width="100%"></td>
<td width="50%"><img src="docs/images/showcase/tp-pred2.png" width="100%"></td>
</tr>
<tr>
<td><img src="docs/images/showcase/tp-pred3.png" width="100%"></td>
<td><img src="docs/images/showcase/tp-pred4.png" width="100%"></td>
</tr>
</table>

## Setup clariden/santis container environment
Container environment setup needed to run training and inference experiments on clariden/santis is contained in this repository under `ci/edf/modulus_env.toml`. Image squash is on clariden/alps under `/capstor/scratch/cscs/pstamenk/corr_diff.sqsh`. All the jobs can be run using this environment without additional installations and setup.

To set up the environment for **HiRAD-Gen** on Alps supercomputer, follow these steps:

1. **Start the PyTorch user environment**:
```bash
uenv start pytorch/v2.6.0:v1 --view=default
```

2. **Create a Python virtual environment** (replace `{env_name}` with your desired environment name):
```bash
python -m venv ./{env_name}
```

3. **Activate the virtual environment**:
```bash
source ./{env_name}/bin/activate
```

4. **Install project dependencies**:
```bash
pip install -e .
```

This will set up the necessary environment to run HiRAD-Gen within the Alps infrastructure.

## Training
## Inference

### Run regression model training (Alps)
### Running inference on Alps

1. Script for running the training of regression model is in `src/hirad/train_regression.sh`.
1. Script for running the inference is in `src/hirad/generate.sh`.
Inside this script set the following:
```bash
### OUTPUT ###
#SBATCH --output=your_path_to_output_log
#SBATCH --error=your_path_to_output_error
#SBATCH --output=path_to_output_log
#SBATCH --error=path_to_output_error
```
```bash
#SBATCH -A your_compute_group
#SBATCH -A compute_group
```
```bash
srun bash -c "
. ./{your_env_name}/bin/activate
python src/hirad/training/train.py --config-name=training_era_cosmo_regression.yaml
srun --mpi=pmix --network=disable_rdzv_get --environment=./ci/edf/modulus_env.toml bash -c "
pip install -e .
python src/hirad/inference/generate.py --config-name=main-config-file-in-src/hirad/conf.yaml
"
```

2. Set up the following config files in `src/hirad/conf`:

- In `training_era_cosmo_regression.yaml` set:
- In main config file (by default `generate_era_real.yaml`) set:
```
hydra:
run:
dir: your_path_to_save_training_output
dir: your_path_to_save_inference_output
```
- In `training/era_cosmo_regression.yaml` set:
- In generation config file (by default `generation/era_real.yaml`):
Choose the inference mode:
```
hp:
training_duration: number of samples to train for (set to 4 for debugging, 512 fits into 30 minutes on 1 gpu with total_batch_size: 4)
inference_mode: all/regression/diffusion
```
by default `all` does both regression and diffusion. Depending on mode, regression and/or diffusion model pretrained weights should be provided:
```
- In `dataset/era_cosmo.yaml` set the `dataset_path` if different from default.
io:
res_ckpt_path: path_to_directory_containing_diffusion_training_model_checkpoints
reg_ckpt_path: path_to_directory_containing_regression_training_model_checkpoints
```
Finally, from the dataset, subset of time steps can be chosen to do inference for.

One way is to list steps under `times:` in format `%Y%m%d-%H%M` for era5_cosmo dataset.

The other way is to specify `times_range:` with three items: first time step (`%Y%m%d-%H%M`), last time step (`%Y%m%d-%H%M`), hour shift (int). Hour shift specifies distance in hours between closest time steps for specific dataset.

3. Submit the job with:
```bash
sbatch src/hirad/train_regression.sh
sbatch src/hirad/generate.sh
```

### Run diffusion model training (Alps)
Before training diffusion model, checkpoint for regression model has to exist.
### Visualizing results

1. Script for running the training of diffusion model is in `src/hirad/train_diffusion.sh`.
After generation is finished, visualization of results can be done using `src/hirad/snapshots.sh`. Set:
```bash
### OUTPUT ###
#SBATCH --output=path_to_output_log
```
```bash
### ENVIRONMENT ####
#SBATCH -A compute_group
```
```bash
srun --mpi=pmix --network=disable_rdzv_get --environment=./ci/edf/modulus_env.toml bash -c "
pip install -e .
python src/hirad/eval/snapshots.py --config-name=src/hirad/conf/config-file-in-src/hirad/conf.yaml
"
```
In config file (by default `eval_real.yaml`) set:
```bash
# Path to the inference output directory
inference_output_dir: '/path/to/generated/results/directory'
results_dir_name: 'name_of_directory_to_save_output_plots'
```
If you want to generate plots for subset of times from inference set (follow same convection as in generate config):
```
times: list of times to visualize
times_range: [start time, end time, time step] to visualize
```

Other setting can be changed according to output grid.

Submit the job with:
```bash
sbatch src/hirad/snapshots.sh
```

### Evaluation of generated data

Evaluation of generated samples can be done using `src/hirad/eval_precip.sh` and `src/hirad/eval_wind.sh`. Set:
```bash
### OUTPUT ###
#SBATCH --output=path_to_output_log

### ENVIRONMENT ####
#SBATCH -A compute_group

### CONFIG ###
CONFIG_NAME="src/hirad/conf/config_file.yaml"
```
Default config file is the same as for visualization `eval_real.yaml`, and requires to set the same fileds. In both `eval_precip.sh` and `eval_wind.sh` there are several python scripts called. They are all commented out by default. Comment out the ones you want to run.

Submit jobs with:
```bash
sbatch src/hirad/eval_precip.sh
sbatch src/hirad/eval_wind.sh
```

## Training

### Run regression model training (Alps)

1. Script for running the training of regression model is in `src/hirad/train_regression.sh`. Here, you can change the sbatch settings.
Inside this script set the following:
```bash
### OUTPUT ###
#SBATCH --output=your_path_to_output_log
#SBATCH --error=your_path_to_output_error
#SBATCH --output=path_to_output_log
#SBATCH --error=path_to_output_error
```
```bash
#SBATCH -A your_compute_group
#SBATCH -A compute_group
```
```bash
srun bash -c "
. ./{your_env_name}/bin/activate
python src/hirad/training/train.py --config-name=training_era_cosmo_diffusion.yaml
srun --mpi=pmix --network=disable_rdzv_get --environment=./ci/edf/modulus_env.toml bash -c "
pip install -e .
python src/hirad/training/train.py --config-name=main-config-file-in-src/hirad/conf.yaml
"
```

2. Set up the following config files in `src/hirad/conf`:

- In `training_era_cosmo_diffusion.yaml` set:
- In main config file (by default `training_era_real_regression.yaml`) set:
```
hydra:
run:
dir: your_path_to_save_training_output
```
- In `training/era_cosmo_regression.yaml` set:
```
hp:
training_duration: number of samples to train for (set to 4 for debugging, 512 fits into 30 minutes on 1 gpu with total_batch_size: 4)
io:
regression_checkpoint_path: path_to_directory_containing_regression_training_model_checkpoints
dir: your_path_to_save_training_outputs
```
- In `dataset/era_cosmo.yaml` set the `dataset_path` if different from default.
- All other parameters for training regression can be changed in the main config file and config files the main config is referencing (default values are working for debugging purposes).

3. Submit the job with:
```bash
sbatch src/hirad/train_diffusion.sh
sbatch src/hirad/train_regression.sh
```

## Inference

### Running inference on Alps
### Run diffusion model training (Alps)
Before training diffusion model, checkpoint for regression model has to exist.

1. Script for running the inference is in `src/hirad/generate.sh`.
Inside this script set the following:
1. Script for running the training of diffusion model is in `src/hirad/train_diffusion.sh`. Here, you can change the sbatch settings. Inside this script set the following:
```bash
### OUTPUT ###
#SBATCH --output=your_path_to_output_log
Expand All @@ -126,42 +216,33 @@ Inside this script set the following:
#SBATCH -A your_compute_group
```
```bash
srun bash -c "
. ./{your_env_name}/bin/activate
python src/hirad/inference/generate.py --config-name=generate_era_cosmo.yaml
srun --mpi=pmix --network=disable_rdzv_get --environment=./ci/edf/modulus_env.toml bash -c "
pip install -e .
python src/hirad/training/train.py --config-name=main-config-file-in-src/hirad/conf.yaml
"
```

2. Set up the following config files in `src/hirad/conf`:

- In `generate_era_cosmo.yaml` set:
- In main config file (by default `training_era_real_diffusion_patched.yaml`) set:
```
hydra:
run:
dir: your_path_to_save_inference_output
```
- In `generation/era_cosmo.yaml`:
Choose the inference mode:
```
inference_mode: all/regression/diffusion
dir: your_path_to_save_training_output
```
by default `all` does both regression and diffusion. Depending on mode, regression and/or diffusion model pretrained weights should be provided:
- In training config file (by default `training/era_real_diffusion_patched.yaml`) set:
```
io:
res_ckpt_path: path_to_directory_containing_diffusion_training_model_checkpoints
reg_ckpt_path: path_to_directory_containing_regression_training_model_checkpoints
regression_checkpoint_path: path_to_directory_containing_regression_training_model_checkpoints
```
Finally, from the dataset, subset of time steps can be chosen to do inference for.

One way is to list steps under `times:` in format `%Y%m%d-%H%M` for era5_cosmo dataset.

The other way is to specify `times_range:` with three items: first time step (`%Y%m%d-%H%M`), last time step (`%Y%m%d-%H%M`), hour shift (int). Hour shift specifies distance in hours between closest time steps for specific dataset (6 for era_cosmo).

By default, inference is done for one time step `20160101-0000`

- In `dataset/era_cosmo.yaml` set the `dataset_path` if different from default.
- All other parameters for training regression can be changed in the main config file and config files the main config is referencing (default values are working for debugging purposes).

3. Submit the job with:
```bash
sbatch src/hirad/generate.sh
```
sbatch src/hirad/train_diffusion.sh
```

## MLflow logging

During training MLflow can be used to log metrics.
Logging config files for regression and diffusion are located in `src/hirad/conf/logging/`. Set `method` to `mlflow` and specify `uri` if you want to log on remote server, otherwise run will be logged locally in output directory. Other options can also be modified here.
26 changes: 16 additions & 10 deletions ci/cscs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,20 @@ build_job:
stage: build
extends: .container-builder-cscs-gh200
variables:
DOCKERFILE: ci/docker/Dockerfile
DOCKERFILE: ci/docker/Dockerfile.ci
KUBERNETES_MEMORY_REQUEST: '64Gi'
KUBERNETES_MEMORY_LIMIT: '64Gi'

#test_job:
# stage: test
# extends: .container-runner-clariden-gh200
# image: $PERSIST_IMAGE_NAME
# script:
# - /opt/helloworld/bin/hello
# variables:
# SLURM_JOB_NUM_NODES: 2
# SLURM_NTASKS: 2
test_job:
stage: test
extends: .container-runner-santis-gh200
image: $PERSIST_IMAGE_NAME
script:
- pytest /opt/hirad-gen/tests -v
variables:
USE_MPI: NO
SLURM_MPI_TYPE: pmix
SLURM_NETWORK: disable_rdzv_get
PMIX_MCA_psec: native
SLURM_JOB_NUM_NODES: 1
SLURM_NTASKS: 1
Loading