Skip to content

Decouple storageClass from recipe overlays — make it a deployer concern #503

@yuanchen8911

Description

@yuanchen8911

Motivation

Recipes define what to deploy (components, versions, constraints). Bundles define how to deploy it (node selectors, tolerations, storageClass). StorageClass is a cluster infrastructure detail — it depends on what the target cluster has provisioned, not on the recipe's component logic. This is the same reasoning behind --accelerated-node-selector and --system-node-toleration, which are also cluster-specific deployment details that live at bundle time, not in recipes.

Benefits of decoupling:

  1. Same recipe, multiple clusters — One EKS recipe works across clusters with different storage setups (gp2 vs gp3 vs io2) without maintaining separate overlays for each
  2. Unblocks dynamo mixin — The only thing preventing a shared platform-dynamo mixin is the per-service storageClass. Moving it to bundle time makes the component definition service-agnostic
  3. Fewer files to maintain — StorageClass is currently duplicated across 8+ overlay files. At bundle time, it's one flag
  4. User knows their cluster best — The deployer knows which storageClasses exist and which tier fits their workload. The recipe author shouldn't have to guess

Important caveat: StorageClass is not purely deployer trivia. Different components can legitimately need different storage tiers (e.g., etcd wants low-latency IO, NATS wants throughput, Prometheus wants cost-efficient capacity). A --storage-class flag should be a convenience default, not the only abstraction. Per-component control via --set remains essential for production-tuned deployments.

Problem

StorageClass names are hardcoded in recipe overlay files, scattered across two layers:

  1. Service-level overlays (eks.yaml#L44-L58, aks.yaml, gke-cos.yaml) — for Prometheus PVC
  2. Dynamo leaf overlays (h100-eks-ubuntu-inference-dynamo.yaml#L64-L73 and 4 other files) — for etcd + NATS persistent storage

Current mapping:

CSP StorageClass Components Using It
EKS gp2 Prometheus, Dynamo etcd, Dynamo NATS
AKS managed-csi Prometheus, Dynamo etcd, Dynamo NATS
GKE standard-rwo Prometheus, Dynamo etcd, Dynamo NATS
Kind standard Dynamo etcd, Dynamo NATS

This has several issues:

  • Scattered configuration — Changing storageClass requires editing multiple overlay files
  • One-size-fits-all — Each CSP supports multiple storage tiers (e.g., EKS has gp2, gp3, io1, io2, st1, sc1), but AICR hardcodes one per CSP. Different components may benefit from different tiers (e.g., etcd wants low-latency io2, NATS wants throughput-optimized gp3, Prometheus wants cost-efficient st1)
  • Blocks further mixin extraction — Dynamo overlays can't be extracted into a shared mixin because storageClass varies per service
  • Recipe authors must know cluster storage setup — This knowledge belongs to the deployer, not the recipe author

Proposal

Make storageClass a CLI override at bundle time. The --storage-class flag provides a convenience default; --set retains per-component control for production-tuned deployments that need different tiers per component.

Short-term (no code changes needed)

--set already supports per-component storageClass overrides today via the existing override machinery (pkg/component/overrides.go#L74-L98, pkg/bundler/bundler.go#L454-L473):

aicr bundle -r recipe.yaml \
  --set dynamoplatform:etcd.persistence.storageClass=gp3 \
  --set dynamoplatform:nats.config.jetstream.fileStore.pvc.storageClassName=gp3 \
  --set kubeprometheustack:prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3

This works but requires knowing deeply nested Helm value paths.

Medium-term (new convenience flag)

Add a --storage-class flag that sets a default for all components needing persistent storage:

# Simple: one class for all components (covers 90% of deployments)
aicr bundle -r recipe.yaml --storage-class gp3

# Override specific component if needed
aicr bundle -r recipe.yaml --storage-class gp3 \
  --set dynamoplatform:etcd.persistence.storageClass=io2

Implementation approach: The --storage-class flag needs to know which Helm paths to mutate for each component. The cleanest model is the registry-driven approach already used for node scheduling (registry.yaml#L20-L36), where each component declares its storage paths declaratively in the registry rather than bespoke if/else logic in the bundler:

# Example registry.yaml extension
- name: dynamo-platform
  storagePaths:
    - etcd.persistence.storageClass
    - nats.config.jetstream.fileStore.pvc.storageClassName

This keeps the bundler generic and lets new components declare their storage paths without code changes.

Future direction (not in scope)

Snapshot-driven storage selection — AICR does not currently model "storage intent" or choose among discovered storageClasses semantically. Detecting available classes from a cluster and mapping component requirements to the right tier is a separate design problem, not a direct alternative to the CLI-flag proposal. Worth exploring later if storage requirements grow more complex.

Alternatives considered

Option Pros Cons
Registry-driven --storage-class flag (proposed) Declarative, follows existing node-scheduling pattern Requires registry schema extension
Registry defaults per serviceregistry.yaml maps service→storageClass Single source of truth Still hardcoded, just in a different file
Keep as-is — hardcoded per overlay Works today Scattered, rigid, blocks mixin extraction

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions