Skip to content

Latest commit

 

History

History
121 lines (84 loc) · 4.16 KB

File metadata and controls

121 lines (84 loc) · 4.16 KB

Setup Guide

System Requirements

  • NVIDIA GPUs with Ampere architecture (RTX 30 Series, A100) or newer
  • NVIDIA driver >=570.124.06 compatible with CUDA 12.8.1
  • Linux x86-64
  • glibc>=2.35 (e.g Ubuntu >=22.04)

Installation

Install git lfs:

sudo apt install git-lfs
git lfs install

Clone the repository:

git clone git@github.com:NVIDIA-Medtech/Cosmos-H-Surgical.git
cd Cosmos-H-Surgical
git lfs pull

Install one of the following environments:

Virtual Environment

Install system dependencies:

sudo apt update && sudo apt -y install curl ffmpeg libx11-dev tree wget
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Install the package into a new environment:

cd transfer
uv python install
uv sync --extra=cu128
source .venv/bin/activate

Or, install the package into the active environment (e.g. conda):

uv sync --extra=cu128 --active --inexact

CUDA Variants:

CUDA Version Arguments Notes
CUDA 12.8 --extra cu128 NVIDIA Driver
CUDA 13.0 --extra cu130 NVIDIA Driver

For DGX Spark and Jetson AGX, you must use CUDA 13.0.

Docker Container

Please make sure you have access to Docker on your machine and the NVIDIA Container Toolkit is installed.

Build the container:

# Ampere - Hopper
image_tag=$(docker build -f Dockerfile -q .)
# Blackwell
image_tag=$(docker build -f docker/nightly.Dockerfile -q .)

Run the container:

docker run -it --runtime=nvidia --ipc=host --rm -v .:/workspace -v /workspace/.venv -v /root/.cache:/root/.cache -e HF_TOKEN="$HF_TOKEN" $image_tag

Optional arguments:

  • --ipc=host: Use host system's shared memory, since parallel torchrun consumes a large amount of shared memory. If not allowed by security policy, increase --shm-size (documentation).
  • -v /root/.cache:/root/.cache: Mount host cache to avoid re-downloading cache entries.
  • -e HF_TOKEN="$HF_TOKEN": Set Hugging Face token to avoid re-authenticating.

If you get docker: Error response from daemon: unknown or invalid runtime name: nvidia, you need to configure docker:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Downloading Checkpoints

  1. Get a Hugging Face Access Token with Read permission
  2. Install Hugging Face CLI: uv tool install -U "huggingface_hub[cli]"
  3. Login: hf auth login
  4. Accept the model license agreements on Hugging Face:

Checkpoints are automatically downloaded during inference and post-training. To modify the checkpoint cache location, set the HF_HOME environment variable.