-
Notifications
You must be signed in to change notification settings - Fork 27
Decouple storageClass from recipe overlays — make it a deployer concern #503
Description
Motivation
Recipes define what to deploy (components, versions, constraints). Bundles define how to deploy it (node selectors, tolerations, storageClass). StorageClass is a cluster infrastructure detail — it depends on what the target cluster has provisioned, not on the recipe's component logic. This is the same reasoning behind --accelerated-node-selector and --system-node-toleration, which are also cluster-specific deployment details that live at bundle time, not in recipes.
Benefits of decoupling:
- Same recipe, multiple clusters — One EKS recipe works across clusters with different storage setups (gp2 vs gp3 vs io2) without maintaining separate overlays for each
- Unblocks dynamo mixin — The only thing preventing a shared
platform-dynamomixin is the per-service storageClass. Moving it to bundle time makes the component definition service-agnostic - Fewer files to maintain — StorageClass is currently duplicated across 8+ overlay files. At bundle time, it's one flag
- User knows their cluster best — The deployer knows which storageClasses exist and which tier fits their workload. The recipe author shouldn't have to guess
Important caveat: StorageClass is not purely deployer trivia. Different components can legitimately need different storage tiers (e.g., etcd wants low-latency IO, NATS wants throughput, Prometheus wants cost-efficient capacity). A --storage-class flag should be a convenience default, not the only abstraction. Per-component control via --set remains essential for production-tuned deployments.
Problem
StorageClass names are hardcoded in recipe overlay files, scattered across two layers:
- Service-level overlays (
eks.yaml#L44-L58,aks.yaml,gke-cos.yaml) — for Prometheus PVC - Dynamo leaf overlays (
h100-eks-ubuntu-inference-dynamo.yaml#L64-L73and 4 other files) — for etcd + NATS persistent storage
Current mapping:
| CSP | StorageClass | Components Using It |
|---|---|---|
| EKS | gp2 |
Prometheus, Dynamo etcd, Dynamo NATS |
| AKS | managed-csi |
Prometheus, Dynamo etcd, Dynamo NATS |
| GKE | standard-rwo |
Prometheus, Dynamo etcd, Dynamo NATS |
| Kind | standard |
Dynamo etcd, Dynamo NATS |
This has several issues:
- Scattered configuration — Changing storageClass requires editing multiple overlay files
- One-size-fits-all — Each CSP supports multiple storage tiers (e.g., EKS has
gp2,gp3,io1,io2,st1,sc1), but AICR hardcodes one per CSP. Different components may benefit from different tiers (e.g., etcd wants low-latencyio2, NATS wants throughput-optimizedgp3, Prometheus wants cost-efficientst1) - Blocks further mixin extraction — Dynamo overlays can't be extracted into a shared mixin because storageClass varies per service
- Recipe authors must know cluster storage setup — This knowledge belongs to the deployer, not the recipe author
Proposal
Make storageClass a CLI override at bundle time. The --storage-class flag provides a convenience default; --set retains per-component control for production-tuned deployments that need different tiers per component.
Short-term (no code changes needed)
--set already supports per-component storageClass overrides today via the existing override machinery (pkg/component/overrides.go#L74-L98, pkg/bundler/bundler.go#L454-L473):
aicr bundle -r recipe.yaml \
--set dynamoplatform:etcd.persistence.storageClass=gp3 \
--set dynamoplatform:nats.config.jetstream.fileStore.pvc.storageClassName=gp3 \
--set kubeprometheustack:prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3This works but requires knowing deeply nested Helm value paths.
Medium-term (new convenience flag)
Add a --storage-class flag that sets a default for all components needing persistent storage:
# Simple: one class for all components (covers 90% of deployments)
aicr bundle -r recipe.yaml --storage-class gp3
# Override specific component if needed
aicr bundle -r recipe.yaml --storage-class gp3 \
--set dynamoplatform:etcd.persistence.storageClass=io2Implementation approach: The --storage-class flag needs to know which Helm paths to mutate for each component. The cleanest model is the registry-driven approach already used for node scheduling (registry.yaml#L20-L36), where each component declares its storage paths declaratively in the registry rather than bespoke if/else logic in the bundler:
# Example registry.yaml extension
- name: dynamo-platform
storagePaths:
- etcd.persistence.storageClass
- nats.config.jetstream.fileStore.pvc.storageClassNameThis keeps the bundler generic and lets new components declare their storage paths without code changes.
Future direction (not in scope)
Snapshot-driven storage selection — AICR does not currently model "storage intent" or choose among discovered storageClasses semantically. Detecting available classes from a cluster and mapping component requirements to the right tier is a separate design problem, not a direct alternative to the CLI-flag proposal. Worth exploring later if storage requirements grow more complex.
Alternatives considered
| Option | Pros | Cons |
|---|---|---|
Registry-driven --storage-class flag (proposed) |
Declarative, follows existing node-scheduling pattern | Requires registry schema extension |
Registry defaults per service — registry.yaml maps service→storageClass |
Single source of truth | Still hardcoded, just in a different file |
| Keep as-is — hardcoded per overlay | Works today | Scattered, rigid, blocks mixin extraction |