Skip to content

OOM issues with 4608 and 5120 buckets #576

@gdlampe

Description

@gdlampe

Hi!

I'm folding a variety of homomer proteins and am running into OOM issues for the larger folds. I originally tried to fold them in one session, but the larger folds led to OOM, which made me think that the larger buckets couldn't be compiled with the smaller buckets. I am now running each bucket separately to make sure only one bucket is folded at a time, but for the 4608 bucket size I am still getting OOM. I'm running on an H100 with 80GB GPU so i'm a little confused why.

nvidia-smi output:

Fri Dec 19 20:27:14 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.274.02             Driver Version: 535.274.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          Off | 00000000:33:00.0 Off |                    0 |
| N/A   31C    P0             113W / 700W |   1957MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

docker command:

        set -x
        docker run --rm \
            --volume "$(pwd)/$bin_dir":/root/af_input \
            --volume "$(pwd)/$bin_output_dir":/root/af_output \
            --volume "$HOME/AF3_model":/root/models \
            --volume "$HOME/AF3_db/sharded_databases":/root/public_databases \
            --gpus all \
            -e XLA_PYTHON_CLIENT_PREALLOCATE=true \
            -e TF_FORCE_UNIFIED_MEMORY=true \
            $AF_IMAGE python run_alphafold.py \
                --input_dir=/root/af_input \
                --model_dir=/root/models \
                --output_dir=/root/af_output \
                --run_data_pipeline=false \
                --buckets=256,512,768,1024,1280,1536,2048,2560,3072,3584,4096,4608,5120
    ) 2>&1 | tee "${bin_output_dir}/${INPUT_BASENAME}_bin_${padded_num}.log"

log file from docker run:

+ docker run --rm --volume /home/ubuntu/remaining4/bin_02:/root/af_input --volume /home/ubuntu/remaining4_bin_02_output:/root/af_output --volume /home/ubuntu/AF3_model:/root/models --volume /home/ubuntu/AF3_db/sharded_databases:/root/public_databases --gpus all -e XLA_PYTHON_CLIENT_PREALLOCATE=true -e TF_FORCE_UNIFIED_MEMORY=true alphafold3 python run_alphafold.py --input_dir=/root/af_input --model_dir=/root/models --output_dir=/root/af_output --run_data_pipeline=false --buckets=256,512,768,1024,1280,1536,2048,2560,3072,3584,4096,4608,5120
I1219 18:12:08.387334 133477095690560 xla_bridge.py:895] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
I1219 18:12:08.388193 133477095690560 xla_bridge.py:895] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
I1219 18:12:18.181271 133477095690560 pipeline.py:173] processing WP_147157570.1_copies_4, random_seed=1
I1219 18:12:18.784216 133477095690560 pipeline.py:266] Calculating bucket size for input with 4184 tokens.
I1219 18:12:18.784405 133477095690560 pipeline.py:272] Got bucket size 4608 for input with 4184 tokens, resulting in 424 padded tokens.
2025-12-19 18:13:21.515373: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 72.33GiB (77668778489 bytes) by rematerialization; only reduced to 82.68GiB (88779527564 bytes), down from 82.68GiB (88779527564 bytes) originally
2025-12-19 18:13:38.860537: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 81.01GiB (rounded to 86984403200)requested by op 
2025-12-19 18:13:38.861254: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ****________________________________________________________________________________________________
E1219 18:13:38.861577       1 pjrt_stream_executor_client.cc:3084] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 86984403160 bytes.
Traceback (most recent call last):
  File "/app/alphafold/run_alphafold.py", line 981, in <module>

Running AlphaFold 3. Please note that standard AlphaFold 3 model parameters are
only available under terms of use provided at
https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md.
If you do not agree to these terms and are using AlphaFold 3 derived model
parameters, cancel execution of AlphaFold 3 inference with CTRL-C, and do not
use the model parameters.

Found local devices: [CudaDevice(id=0)], using device 0: cuda:0
Building model from scratch...
Checking that model parameters can be loaded...

Running fold job WP_147157570.1_copies_4...
Output will be written in /root/af_output/WP_147157570.1_copies_4
Skipping data pipeline...
Writing model input JSON to /root/af_output/WP_147157570.1_copies_4/WP_147157570.1_copies_4_data.json
Predicting 3D structure for WP_147157570.1_copies_4 with 1 seed(s)...
Featurising data with 1 seed(s)...
Featurising data with seed 1.
Featurising data with seed 1 took 48.11 seconds.
Featurising data with 1 seed(s) took 53.23 seconds.
Running model inference and extracting output structure samples with 1 seed(s)...
Running model inference with seed 1...
    app.run(main)
  File "/alphafold3_venv/lib/python3.12/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/alphafold3_venv/lib/python3.12/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/app/alphafold/run_alphafold.py", line 963, in main
    process_fold_input(
  File "/app/alphafold/run_alphafold.py", line 797, in process_fold_input
    all_inference_results = predict_structure(
                            ^^^^^^^^^^^^^^^^^^
  File "/app/alphafold/run_alphafold.py", line 543, in predict_structure
    result = model_runner.run_inference(example, rng_key)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/alphafold/run_alphafold.py", line 438, in run_inference
    result = self._model(rng_key, featurised_example)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 86984403160 bytes.
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

when i update to enable unified memory spillover below:

(
        set -x
        docker run --rm \
            --volume "$(pwd)/$bin_dir":/root/af_input \
            --volume "$(pwd)/$bin_output_dir":/root/af_output \
            --volume "$HOME/AF3_model":/root/models \
            --volume "$HOME/AF3_db/sharded_databases":/root/public_databases \
            --gpus all \
            -e XLA_PYTHON_CLIENT_PREALLOCATE=true \
            -e XLA_CLIENT_MEM_FRACTION=0.98 \
            -e TF_FORCE_UNIFIED_MEMORY=true \
            -e XLA_CLIENT_MEM_FRACTION=3.2 \
            $AF_IMAGE python run_alphafold.py \
                --input_dir=/root/af_input \
                --model_dir=/root/models \
                --output_dir=/root/af_output \
                --run_data_pipeline=false \
                --buckets=256,512,768,1024,1280,1536,2048,2560,3072,3584,4096,4608,5120
    ) 2>&1 | tee "${bin_output_dir}/${INPUT_BASENAME}_bin_${padded_num}.log"

it is able to run, but the memory spillover is substantial (one 4608 fold takes ~35 minutes, where a 5120 bucket fold takes ~24 minutes according to performance docs)

I am not entirely sure what may be wrong, any help would be much appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions