Add architecture-aware NIM bootstrap#2
Draft
nil16 wants to merge 28 commits into
Draft
Conversation
MolMIM has no aarch64 NIM image and BioNeMo Framework v1 is amd64/pre-Blackwell, so GB200/GB300 could not run local MolMIM (no /hidden or /decode, blocking CMA-ES guided optimization). Add molmim_arm/: a pure-PyTorch MolMIM (loads the official molmim_70m_24_3 checkpoint, no NeMo/Megatron/apex/TE) wrapped in a FastAPI service that mirrors the MolMIM NIM REST surface, plus run_molmim_arm.sh to build and run it. Wire --molmim local-arm through the service scripts and fix a run_nim_docker.sh CUDA_VISIBLE_DEVICES remap bug for --gpus device=N (N>0).
Boltz-2 now requires sampling_steps >= 10 ("Sampling steps must be at least 10,
got 5"); bump the readiness smoke test from 5 to 10. Add nest_asyncio, which the
notebooks need to run the async Boltz-2 affinity client inside a Jupyter kernel
(otherwise the affinity cells raise "This event loop is already running").
Fix two pre-existing bugs in the CMA-ES tutorials (tutorials/04 and mini-hands-on/04): an unterminated f-string from a split print(), and a fragile optimization loop that dropped invalid/duplicate decodes and could pass fewer than mu solutions to optimizer.tell() (stochastic crash). The loop now asks the full population, penalizes invalid/duplicate molecules, and tells the actual asked solutions. Update notebook text across the setup and CDK notebooks so ARM (GB200/GB300) users know to run local MolMIM via --molmim local-arm for /hidden, /decode, and CMA-ES.
Update README, HOW_TO_GET_STARTED, deployment, singularity, and the mini and cdk_oracle READMEs to describe the pure-PyTorch ARM MolMIM NIM as the way to get /hidden, /decode, and CMA-ES on GB200/GB300 (in addition to hosted MolMIM for generation only). Note the nest_asyncio dependency and refresh stale "ARM must use hosted MolMIM" guidance.
ARM-native MolMIM NIM (local-arm) + Boltz-2/notebook fixes + docs
The HPC path is Apptainer/Singularity (Boltz-2 already runs there), but the ARM MolMIM local-arm service was Docker-only, so it failed on Apptainer-only clusters. Add molmim_arm/molmim_arm.def and an Apptainer build/run path in run_molmim_arm.sh (apptainer build from the PyTorch base + apptainer run --nv with weights mounted at /models), selected via OPENHACKATHON_CONTAINER_RUNTIME. Cap mksquashfs processors (--mksquashfs-args) to avoid segfaults on many-core ARM (Grace). Make local-arm the auto-mode default on ARM whenever a container runtime (Docker or Apptainer/Singularity) is available, with hosted fallback.
Update README, HOW_TO_GET_STARTED, deployment, singularity, and mini README to state that local-arm builds/runs via Docker or Apptainer/Singularity and is the auto-mode default on ARM (hosted MolMIM is the fallback), replacing the earlier "Apptainer-only ARM falls back to hosted" guidance.
…ainer ARM MolMIM NIM on Apptainer/Singularity + default to local-arm on ARM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation