Skip to content

Latest commit

 

History

History
193 lines (155 loc) · 5.96 KB

File metadata and controls

193 lines (155 loc) · 5.96 KB
title Model Caching
subtitle Download models once and share across all pods in a Kubernetes cluster

Large language models can take minutes to download. Without caching, every pod downloads the full model independently, wasting bandwidth and delaying startup. Dynamo supports two approaches to ensure models are downloaded once and shared across the cluster.

Option 1: PVC + Download Job (Recommended)

The simplest approach: create a shared PVC, run a one-time Job to download the model, then mount the PVC in your DynamoGraphDeployment.

This is the pattern used by all Dynamo recipes today.

Step 1: Create a Shared PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
`ReadWriteMany` access mode is required so multiple pods can mount the PVC simultaneously. Ensure your storage class supports RWX (e.g., NFS, CephFS, or cloud-provider shared file systems).

Step 2: Download the model

apiVersion: batch/v1
kind: Job
metadata:
  name: model-download
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: downloader
          image: python:3.12-slim
          command: ["sh", "-c"]
          args:
            - |
              pip install huggingface_hub hf_transfer
              HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
                $MODEL_NAME --revision $MODEL_REVISION
          env:
            - name: MODEL_NAME
              value: "Qwen/Qwen3-0.6B"
            - name: MODEL_REVISION
              value: "main"
            - name: HF_HOME
              value: /cache/huggingface
          envFrom:
            - secretRef:
                name: hf-token-secret
          volumeMounts:
            - name: model-cache
              mountPath: /cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache

Find the Snapshot Path

After the Job completes, the model is stored in HuggingFace's cache layout:

hub/models--<org>--<model>/snapshots/<commit-hash>/

For example, meta-llama/Llama-3.1-70B-Instruct becomes:

hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/9d3b8e0f71f8c1e0f9b7c2a3d4e5f6a7b8c9d0e1/

To find the exact commit hash after the download Job completes:

kubectl run find-snapshot --rm -it --image=busybox --restart=Never \
  --overrides='{
    "spec": {
      "volumes": [{"name": "c", "persistentVolumeClaim": {"claimName": "model-cache"}}],
      "containers": [{
        "name": "f", "image": "busybox",
        "command": ["find", "/c/hub", "-mindepth", "3", "-maxdepth", "3", "-type", "d"],
        "volumeMounts": [{"name": "c", "mountPath": "/c"}]
      }]
    }
  }'

Alternatively, look up the commit hash on the HuggingFace Hub model page under Files and versions.

You need this path for the pvcModelPath field in a DGDR spec (see Model Deployment Guide — Model Caching).

Step 3: Mount in DynamoGraphDeployment

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-deployment
spec:
  pvcs:
    - create: false
      name: model-cache
  services:
    VllmWorker:
      volumeMounts:
        - name: model-cache
          mountPoint: /home/dynamo/.cache/huggingface

All VllmWorker pods that mount model-cache now read from the shared cache, avoiding per-pod worker downloads. If you also want the frontend to reuse tokenizer and config files, mount the same PVC there too.

Compilation Cache

For vLLM, you can also cache compiled artifacts (CUDA graphs, etc.) with a second PVC:

spec:
  pvcs:
    - create: false
      name: model-cache
    - create: false
      name: compilation-cache
  services:
    VllmWorker:
      volumeMounts:
        - name: model-cache
          mountPoint: /home/dynamo/.cache/huggingface
        - name: compilation-cache
          mountPoint: /home/dynamo/.cache/vllm

Option 2: Model Express (P2P Distribution)

Model Express is a P2P model distribution server that downloads a model once and serves it to all pods over the network. It integrates directly with vLLM's weight loading pipeline via custom load formats.

How It Works

  1. A Model Express server runs in the cluster and caches model weights
  2. Workers use --load-format=mx-source or --load-format=mx-target to load from the server
  3. The K8s operator injects MODEL_EXPRESS_URL into all pods automatically

Setup

Install with Dynamo Platform:

helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
  --namespace ${NAMESPACE} \
  --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"

Configure workers to use Model Express:

services:
  VllmWorker:
    envs:
      - name: VLLM_LOAD_FORMAT
        value: mx-target

When MODEL_EXPRESS_URL is configured in the operator, it is automatically injected as an environment variable into all component pods. Workers using mx-source or mx-target load formats will connect to the server for model weight distribution.

When to Use Model Express

Scenario Recommended Approach
Small cluster, simple setup PVC + Download Job
Large cluster, many nodes Model Express
Models already on shared storage (NFS) PVC
Frequent model updates across fleet Model Express

See Also