Skip to content

Add architecture-aware NIM bootstrap#2

Draft
nil16 wants to merge 28 commits into
mainfrom
bionemo-singularity-neel
Draft

Add architecture-aware NIM bootstrap#2
nil16 wants to merge 28 commits into
mainfrom
bionemo-singularity-neel

Conversation

@nil16

@nil16 nil16 commented May 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • adds a turnkey bootstrap flow for architecture-aware NIM startup
  • uses local MolMIM on amd64 with hosted MolMIM fallback, and hosted MolMIM on ARM
  • updates challenge notebooks and docs for the local Boltz2 plus hosted MolMIM path

Validation

  • bash -n scripts/openhackathon_services.sh scripts/bootstrap_bootcamp.sh scripts/check_nim_health.sh
  • python3 -m py_compile cdk_oracle/config.py cdk_oracle/nim_client.py cdk_oracle/pipeline.py
  • python3 -m json.tool challenge/03_Hands-On_CDK_Inhibitor_Design.ipynb
  • python3 -m json.tool mini-hands-on/03_Hands-On_CDK_Inhibitor_Design.ipynb
  • GB200 remote smoke test with hosted MolMIM and local Boltz2 health checks

@nil16 nil16 changed the title [codex] Add architecture-aware NIM bootstrap Add architecture-aware NIM bootstrap May 18, 2026
nil16 and others added 9 commits May 23, 2026 04:30
MolMIM has no aarch64 NIM image and BioNeMo Framework v1 is amd64/pre-Blackwell,
so GB200/GB300 could not run local MolMIM (no /hidden or /decode, blocking CMA-ES
guided optimization). Add molmim_arm/: a pure-PyTorch MolMIM (loads the official
molmim_70m_24_3 checkpoint, no NeMo/Megatron/apex/TE) wrapped in a FastAPI
service that mirrors the MolMIM NIM REST surface, plus run_molmim_arm.sh to build
and run it. Wire --molmim local-arm through the service scripts and fix a
run_nim_docker.sh CUDA_VISIBLE_DEVICES remap bug for --gpus device=N (N>0).
Boltz-2 now requires sampling_steps >= 10 ("Sampling steps must be at least 10,
got 5"); bump the readiness smoke test from 5 to 10. Add nest_asyncio, which the
notebooks need to run the async Boltz-2 affinity client inside a Jupyter kernel
(otherwise the affinity cells raise "This event loop is already running").
Fix two pre-existing bugs in the CMA-ES tutorials (tutorials/04 and
mini-hands-on/04): an unterminated f-string from a split print(), and a fragile
optimization loop that dropped invalid/duplicate decodes and could pass fewer
than mu solutions to optimizer.tell() (stochastic crash). The loop now asks the
full population, penalizes invalid/duplicate molecules, and tells the actual
asked solutions. Update notebook text across the setup and CDK notebooks so ARM
(GB200/GB300) users know to run local MolMIM via --molmim local-arm for /hidden,
/decode, and CMA-ES.
Update README, HOW_TO_GET_STARTED, deployment, singularity, and the mini and
cdk_oracle READMEs to describe the pure-PyTorch ARM MolMIM NIM as the way to get
/hidden, /decode, and CMA-ES on GB200/GB300 (in addition to hosted MolMIM for
generation only). Note the nest_asyncio dependency and refresh stale
"ARM must use hosted MolMIM" guidance.
ARM-native MolMIM NIM (local-arm) + Boltz-2/notebook fixes + docs
The HPC path is Apptainer/Singularity (Boltz-2 already runs there), but the ARM
MolMIM local-arm service was Docker-only, so it failed on Apptainer-only clusters.
Add molmim_arm/molmim_arm.def and an Apptainer build/run path in
run_molmim_arm.sh (apptainer build from the PyTorch base + apptainer run --nv
with weights mounted at /models), selected via OPENHACKATHON_CONTAINER_RUNTIME.
Cap mksquashfs processors (--mksquashfs-args) to avoid segfaults on many-core
ARM (Grace). Make local-arm the auto-mode default on ARM whenever a container
runtime (Docker or Apptainer/Singularity) is available, with hosted fallback.
Update README, HOW_TO_GET_STARTED, deployment, singularity, and mini README to
state that local-arm builds/runs via Docker or Apptainer/Singularity and is the
auto-mode default on ARM (hosted MolMIM is the fallback), replacing the earlier
"Apptainer-only ARM falls back to hosted" guidance.
…ainer

ARM MolMIM NIM on Apptainer/Singularity + default to local-arm on ARM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant