| title | Model Caching |
|---|---|
| subtitle | Download models once and share across all pods in a Kubernetes cluster |
Large language models can take minutes to download. Without caching, every pod downloads the full model independently, wasting bandwidth and delaying startup. Dynamo supports two approaches to ensure models are downloaded once and shared across the cluster.
The simplest approach: create a shared PVC, run a one-time Job to download the model, then mount the PVC in your DynamoGraphDeployment.
This is the pattern used by all Dynamo recipes today.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100GiapiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
template:
spec:
restartPolicy: Never
containers:
- name: downloader
image: python:3.12-slim
command: ["sh", "-c"]
args:
- |
pip install huggingface_hub hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
$MODEL_NAME --revision $MODEL_REVISION
env:
- name: MODEL_NAME
value: "Qwen/Qwen3-0.6B"
- name: MODEL_REVISION
value: "main"
- name: HF_HOME
value: /cache/huggingface
envFrom:
- secretRef:
name: hf-token-secret
volumeMounts:
- name: model-cache
mountPath: /cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cacheAfter the Job completes, the model is stored in HuggingFace's cache layout:
hub/models--<org>--<model>/snapshots/<commit-hash>/
For example, meta-llama/Llama-3.1-70B-Instruct becomes:
hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/9d3b8e0f71f8c1e0f9b7c2a3d4e5f6a7b8c9d0e1/
To find the exact commit hash after the download Job completes:
kubectl run find-snapshot --rm -it --image=busybox --restart=Never \
--overrides='{
"spec": {
"volumes": [{"name": "c", "persistentVolumeClaim": {"claimName": "model-cache"}}],
"containers": [{
"name": "f", "image": "busybox",
"command": ["find", "/c/hub", "-mindepth", "3", "-maxdepth", "3", "-type", "d"],
"volumeMounts": [{"name": "c", "mountPath": "/c"}]
}]
}
}'Alternatively, look up the commit hash on the HuggingFace Hub model page under Files and versions.
You need this path for the pvcModelPath field in a DGDR spec (see Model Deployment Guide — Model Caching).
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
pvcs:
- create: false
name: model-cache
services:
VllmWorker:
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingfaceAll VllmWorker pods that mount model-cache now read from the shared cache, avoiding per-pod worker downloads. If you also want the frontend to reuse tokenizer and config files, mount the same PVC there too.
For vLLM, you can also cache compiled artifacts (CUDA graphs, etc.) with a second PVC:
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
VllmWorker:
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllmModel Express is a P2P model distribution server that downloads a model once and serves it to all pods over the network. It integrates directly with vLLM's weight loading pipeline via custom load formats.
- A Model Express server runs in the cluster and caches model weights
- Workers use
--load-format=mx-sourceor--load-format=mx-targetto load from the server - The K8s operator injects
MODEL_EXPRESS_URLinto all pods automatically
Install with Dynamo Platform:
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace ${NAMESPACE} \
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"Configure workers to use Model Express:
services:
VllmWorker:
envs:
- name: VLLM_LOAD_FORMAT
value: mx-targetWhen MODEL_EXPRESS_URL is configured in the operator, it is automatically injected as an environment variable into all component pods. Workers using mx-source or mx-target load formats will connect to the server for model weight distribution.
| Scenario | Recommended Approach |
|---|---|
| Small cluster, simple setup | PVC + Download Job |
| Large cluster, many nodes | Model Express |
| Models already on shared storage (NFS) | PVC |
| Frequent model updates across fleet | Model Express |
- Managing Models with DynamoModel — declarative model management CRD
- Detailed Installation Guide — Helm chart configuration including Model Express
- LoRA Adapters — dynamic adapter loading (separate from base model caching)