Hello Maintainers,
I am trying to train Spender with DESI DR1 data. But, upon running this command: python train/train_DESI.py ./DATA/ outfile -b 64 I get the following errors:
[...]
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torchinterp1d/interp1d.py", line 155, in backward
gradients = torch.autograd.grad(
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torchinterp1d/interp1d.py", line 155, in backward
gradients = torch.autograd.grad(
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torchinterp1d/interp1d.py", line 155, in backward
gradients = torch.autograd.grad(
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torchinterp1d/interp1d.py", line 155, in backward
gradients = torch.autograd.grad(
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torchinterp1d/interp1d.py", line 155, in backward
gradients = torch.autograd.grad(
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: thread constructor failed: Resource temporarily unavailable
I added torch.autograd.set_detect_anomaly(True) to train_DESI.py and got some more traceback on where the issue is coming from:
/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.13/site-packages/torch/autograd/graph.py:841: UserWarning: Error detected in Interp1dBackward.
Traceback of forward call that caused the error:
File "/Users/ng27753/Astronomy_Research/spender/train/train_DESI.py", line 463, in <module> train(models, instruments, trainloaders, validloaders, n_epoch=n_epoch,
File "/Users/ng27753/Astronomy_Research/spender/train/train_DESI.py", line 309, in train losses = get_losses( File "/Users/ng27753/Astronomy_Research/spender/train/train_DESI.py", line 176, in get_losses loss, sim_loss, s = _losses(model, instrument, batch, similarity=similarity, slope=slope)
File "/Users/ng27753/Astronomy_Research/spender/train/train_DESI.py", line 159, in _losses loss = model.loss(spec, w, instrument, z=z, s=s)
File "/Users/ng27753/Astronomy_Research/spender/spender/model.py", line 557, in loss s, x, y_, valid = self._forward(y, instrument=instrument, z=z, s=s, normalize=normalize)
File "/Users/ng27753/Astronomy_Research/spender/spender/model.py", line 489, in _forward reconstruction, valid = self.decoder.transform(restframe, instrument=instrument, z=z, return_valid=True)
File "/Users/ng27753/Astronomy_Research/spender/spender/model.py", line 344, in transform spectrum = interp1d(wave_redshifted, x, wave_obs)
File "/Users/ng27753/Astronomy_Research/spender/.venv/lib/python3.13/site-packages/torch/autograd/function.py", line 581, in apply return super().apply(*args, **kwargs) # type: ignore[misc] (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:127.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Here is my pip freeze as well:
❯ pip freeze
absl-py==2.3.1
accelerate==1.10.1
astropy==6.0.1
astropy-iers-data==0.2025.10.20.0.39.8
certifi==2025.10.5
charset-normalizer==3.4.4
contourpy==1.3.0
cycler==0.12.1
filelock==3.19.1
fonttools==4.60.1
fsspec==2025.9.0
GPUtil==1.4.0
grpcio==1.76.0
h5py==3.14.0
hf-xet==1.2.0
huggingface-hub==0.36.0
humanize==4.13.0
idna==3.11
importlib_metadata==8.7.0
importlib_resources==6.5.2
Jinja2==3.1.6
kiwisolver==1.4.7
Markdown==3.9
MarkupSafe==3.0.3
matplotlib==3.9.4
mpmath==1.3.0
networkx==3.2.1
nflows==0.14
numpy==1.26.4
packaging==25.0
pillow==11.3.0
protobuf==6.33.0
psutil==7.1.1
pyerfa==2.0.1.5
pyparsing==3.2.5
python-dateutil==2.9.0.post0
PyYAML==6.0.3
requests==2.32.5
safetensors==0.6.2
six==1.17.0
spender==0.2.8
sympy==1.14.0
tensorboard==2.20.0
tensorboard-data-server==0.7.2
torch==2.1.0
torchinterp1d==1.1
tqdm==4.67.1
typing_extensions==4.15.0
urllib3==2.5.0
Werkzeug==3.1.3
zipp==3.23.0
The problem seems to be occurring here but I am unable to find a solution to fix it:
|
spectrum = interp1d(wave_redshifted, x, wave_obs) |
|
|
|
# need to zero out parts of the spectrum that our outside of the restframe range (see #34) |
|
valid = wave_obs[None,:] > self.wave_rest[0] * (1 + z[:,None]) |
|
valid &= wave_obs[None,:] < self.wave_rest[-1] * (1 + z[:,None]) |
|
spectrum[~valid] = 0 |
|
|
|
# convolve with LSF |
|
if instrument.lsf is not None: |
|
spectrum = instrument.lsf(spectrum.unsqueeze(1)).squeeze(1) |
|
|
|
# apply calibration function to observed spectrum |
|
if instrument is not None and instrument.calibration is not None: |
|
spectrum = instrument.calibration(wave_obs, spectrum) |
|
|
|
if return_valid: |
|
return spectrum, valid |
|
return spectrum |
Hello Maintainers,
I am trying to train Spender with DESI DR1 data. But, upon running this command:
python train/train_DESI.py ./DATA/ outfile -b 64I get the following errors:I added
torch.autograd.set_detect_anomaly(True)to train_DESI.py and got some more traceback on where the issue is coming from:Here is my pip freeze as well:
❯ pip freeze
absl-py==2.3.1
accelerate==1.10.1
astropy==6.0.1
astropy-iers-data==0.2025.10.20.0.39.8
certifi==2025.10.5
charset-normalizer==3.4.4
contourpy==1.3.0
cycler==0.12.1
filelock==3.19.1
fonttools==4.60.1
fsspec==2025.9.0
GPUtil==1.4.0
grpcio==1.76.0
h5py==3.14.0
hf-xet==1.2.0
huggingface-hub==0.36.0
humanize==4.13.0
idna==3.11
importlib_metadata==8.7.0
importlib_resources==6.5.2
Jinja2==3.1.6
kiwisolver==1.4.7
Markdown==3.9
MarkupSafe==3.0.3
matplotlib==3.9.4
mpmath==1.3.0
networkx==3.2.1
nflows==0.14
numpy==1.26.4
packaging==25.0
pillow==11.3.0
protobuf==6.33.0
psutil==7.1.1
pyerfa==2.0.1.5
pyparsing==3.2.5
python-dateutil==2.9.0.post0
PyYAML==6.0.3
requests==2.32.5
safetensors==0.6.2
six==1.17.0
spender==0.2.8
sympy==1.14.0
tensorboard==2.20.0
tensorboard-data-server==0.7.2
torch==2.1.0
torchinterp1d==1.1
tqdm==4.67.1
typing_extensions==4.15.0
urllib3==2.5.0
Werkzeug==3.1.3
zipp==3.23.0
The problem seems to be occurring here but I am unable to find a solution to fix it:
spender/spender/model.py
Lines 344 to 361 in 21df63e