Skip to content

Latest commit

 

History

History
3025 lines (2398 loc) · 86.8 KB

File metadata and controls

3025 lines (2398 loc) · 86.8 KB

CLI Architecture

The aicr CLI provides command-line access to AICR configuration management capabilities.

Overview

The CLI provides a four-step workflow for optimizing GPU infrastructure, plus a query command for inspecting hydrated recipe values:

┌──────────────┐      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Snapshot   │─────▶│    Recipe    │─────▶│   Validate   │─────▶│    Bundle    │
└──────────────┘      └──────────────┘      └──────────────┘      └──────────────┘
   Capture system      Generate optimized    Check cluster         Create deployment
   configuration        recommendations       compatibility         artifacts
                              │
                        ┌─────┴──────┐
                        │   Query    │
                        └────────────┘
                        Extract hydrated
                        config values

Step 1: Snapshot Command

Captures system configuration:

  • Operating system: grub, kmod, sysctl, /etc/os-release
  • SystemD services: containerd, docker, kubelet (service state and configuration)
  • Kubernetes: API server version, container images, ClusterPolicy custom resource
  • GPU hardware: driver version, CUDA libraries, MIG configuration, device properties
  • Node topology (cluster-wide taints and labels)

Output destinations:

  • File: --output system.yaml (local filesystem)
  • Stdout: Default (can be piped to other commands)
  • ConfigMap: --output cm://namespace/name (Kubernetes ConfigMap using Kubernetes API)

Agent deployment:

Kubernetes Job runs on GPU nodes. Writes snapshot to ConfigMap via Kubernetes API. Requires ServiceAccount with ConfigMap create/update permissions (Role in target namespace). Does not require PersistentVolume.

Step 2: Recipe Command

Generates optimized configuration recipes with two modes:

  • Query Mode: Direct recipe generation from system parameters (OS, GPU, K8s, etc.)
  • Snapshot Mode: Analyzes captured snapshots and generates tailored recipes based on workload intent (training/inference)

Input Options:

  • Query parameters: --os ubuntu --gpu gb200 --service eks (direct recipe generation)
  • Snapshot file: --snapshot system.yaml (analyze captured snapshot)
  • ConfigMap: --snapshot cm://namespace/name (read from Kubernetes)

Output Options:

  • File: --output recipe.yaml (write to file)
  • Stdout: Default behavior (pipe to bundle command)
  • ConfigMap: --output cm://namespace/name (store in Kubernetes)

Step 3: Validate Command

Validates recipe constraints against actual system measurements from a snapshot.

Input sources:

  • Recipe file: --recipe recipe.yaml (local filesystem)
  • Recipe URL: --recipe https://example.com/recipe.yaml (HTTP/HTTPS)
  • Recipe ConfigMap: --recipe cm://namespace/name (Kubernetes ConfigMap)
  • Snapshot file: --snapshot snapshot.yaml (local filesystem)
  • Snapshot ConfigMap: --snapshot cm://namespace/name (Kubernetes ConfigMap)

Constraint format:

Constraints use fully qualified measurement paths: {Type}.{Subtype}.{Key}

  • K8s.server.version - Kubernetes server version
  • OS.release.ID - Operating system identifier
  • OS.release.VERSION_ID - OS version
  • OS.sysctl./proc/sys/kernel/osrelease - Kernel version

Supported operators:

  • >= 1.30 - Greater than or equal (version comparison)
  • <= 1.33 - Less than or equal (version comparison)
  • > 1.30, < 2.0 - Strict comparison
  • == ubuntu, != rhel - Equality operators
  • ubuntu - Exact string match (no operator)

Output:

  • Validation result with summary (passed/failed/skipped counts)
  • Individual constraint results with expected vs actual values
  • Status: pass, fail, or partial (some skipped)

CI/CD integration:

By default, the command exits with non-zero status when constraints fail (ideal for CI/CD). To run in informational mode without failing:

aicr validate -r recipe.yaml -s cm://gpu-operator/aicr-snapshot --fail-on-error=false

Step 4: Bundle Command

Generates deployment artifacts from recipes:

  • Helm values files (values.yaml)
  • Kubernetes manifests (ClusterPolicy, NICClusterPolicy, etc.)
  • SHA256 checksum file
  • README documentation (generated at deployer level, not by component bundlers)

Input sources:

  • Recipe file: --recipe recipe.yaml (local filesystem)
  • ConfigMap: --recipe cm://namespace/name (Kubernetes ConfigMap)

Output: Local directory only. ConfigMap output is not supported for bundles.

Current bundlers:

  • GPU Operator: Generates GPU Operator Helm values and ClusterPolicy manifest
  • Network Operator: Generates Network Operator Helm values and NICClusterPolicy manifest
  • Cert-Manager: Generates cert-manager Helm values for certificate management
  • NVSentinel: Generates NVSentinel Helm values
  • Skyhook: Generates Skyhook Operator Helm values and Skyhook CR manifest for node optimization

Value overrides:

The --set flag allows runtime customization of generated bundle values:

aicr bundle -r recipe.yaml \
  --set gpuoperator:gds.enabled=true \
  --set gpuoperator:driver.version=570.86.16

Node scheduling options:

The bundle command supports node selector and toleration flags for controlling workload placement:

# Schedule system components (operators, controllers) on specific nodes
aicr bundle -r recipe.yaml \
  --system-node-selector nodeGroup=system-pool \
  --system-node-toleration dedicated=system:NoSchedule

# Schedule GPU workloads (drivers, device plugins) on GPU nodes
aicr bundle -r recipe.yaml \
  --accelerated-node-selector nvidia.com/gpu.present=true \
  --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule

Flags:

  • --system-node-selector key=value – Node selector for system components (repeatable)
  • --system-node-toleration key=value:effect – Toleration for system components (repeatable)
  • --accelerated-node-selector key=value – Node selector for GPU nodes (repeatable)
  • --accelerated-node-toleration key=value:effect – Toleration for GPU nodes (repeatable)
  • --nodes N – Estimated number of GPU nodes (bundle-time only; written to paths in registry under nodeScheduling.nodeCountPaths)

These flags apply selectors/tolerations to bundler-specific paths (e.g., GPU Operator uses operator.nodeSelector and daemonsets.nodeSelector). The --nodes value is applied to paths listed in the registry under nodeScheduling.nodeCountPaths.

Execution model:

  • Bundlers run concurrently (parallel execution)
  • All components from the recipe are bundled automatically
  • Errors from any bundler cause immediate cancellation via context propagation

Testing: End-to-end workflow validated by Chainsaw tests in tests/chainsaw/cli/

Architecture Diagram

flowchart TD
    A["aicr CLI<br/>cmd/aicr/main.go"] --> B["Root Command<br/>pkg/cli/root.go"]
    
    B --> B1["Version info (ldflags)<br/>Debug flag → Logging<br/>Shell completion"]
    
    B --> C["Step 1: snapshot CMD<br/>pkg/cli/snapshot.go<br/>Capture system state"]
    B --> D["Step 2: recipe CMD<br/>pkg/cli/recipe.go<br/>Query & Snapshot modes"]
    B --> E["Step 3: bundle CMD<br/>pkg/cli/bundle.go<br/>Parallel generation"]
    
    C --> F[Shared Packages]
    D --> F
    E --> F
    
    F --> F1["Collector Factory"]
    F --> F2["Recipe Builder"]
    F --> F3["Snapshotter Service"]
    F --> F4["Serializer<br/>(JSON/YAML/Table)"]
    F --> F5["Bundler Registry<br/>(Parallel execution)"]
Loading

ConfigMap Integration

The CLI supports Kubernetes-native ConfigMap storage using the cm://namespace/name URI scheme:

flowchart LR
    A["aicr snapshot<br/>-o cm://ns/snap"] -->|"Write"| CM1["ConfigMap<br/>aicr-snapshot"]
    
    CM1 -->|"Read"| B["aicr recipe<br/>-s cm://ns/snap<br/>-o cm://ns/recipe"]
    
    B -->|"Write"| CM2["ConfigMap<br/>aicr-recipe"]
    
    CM2 -->|"Read"| C["aicr bundle<br/>-r cm://ns/recipe<br/>-o ./bundles"]
    
    C --> D["Local Bundle<br/>Directory"]
    
    style CM1 fill:#e1f5ff
    style CM2 fill:#e1f5ff
Loading

Benefits:

  • No file dependencies - Direct Kubernetes API integration
  • Agent-friendly - Jobs can write snapshots without volumes
  • Pipeline integration - CI/CD can read/write ConfigMaps
  • Multi-cluster - Share snapshots/recipes across clusters

RBAC Requirements:

  • ConfigMap read/write permissions in target namespace
  • ServiceAccount with appropriate Role/RoleBinding
  • See Agent Deployment for details

Component Details

Entry Point: cmd/aicr/main.go

Minimal entry point that delegates to the CLI package:

package main

import "github.com/NVIDIA/aicr/pkg/cli"

func main() {
    cli.Execute()
}

Root Command: pkg/cli/root.go

Responsibilities:

  • Command registration and routing
  • Version information injection (via ldflags)
  • Global flag handling (debug mode, log formatting)
  • Logging mode selection and initialization

Key Features:

  • Version info: version, commit, date (overridden at build time)
  • Three logging modes:
    • CLI Mode (default): Minimal output for users (SetDefaultCLILogger)
    • Text Mode (--debug): Full metadata for debugging (SetDefaultLoggerWithLevel)
    • JSON Mode (--log-json): Structured logs for automation (SetDefaultStructuredLoggerWithLevel)
  • Logger selection logic:
    switch {
    case c.Bool("log-json"):
        logging.SetDefaultStructuredLoggerWithLevel(name, version, logLevel)
    case isDebug:
        logging.SetDefaultLoggerWithLevel(name, version, logLevel)
    default:
        logging.SetDefaultCLILogger(logLevel)
    }
  • Shell completion support
  • Command listing for auto-completion

Snapshot Command: pkg/cli/snapshot.go

Captures comprehensive system configuration snapshots.

Command Flow

flowchart TD
    A[User Invocation] --> B[Parse Flags<br/>format, output]
    B --> C[Create Collector Factory]
    C --> D[Initialize NodeSnapshotter]
    D --> E[Parallel Collection<br/>errgroup]
    E --> F[Aggregate Measurements]
    F --> G[Serialize Output]
    G --> H[Write to stdout/file]
Loading

Detailed Data Flow

flowchart TD
    A[Snapshot Command] --> B[collector.NewDefaultFactory]
    
    B --> B1["OSCollector<br/>(grub, kmod, sysctl)"]
    B --> B2["SystemDCollector<br/>(containerd, docker, kubelet)"]
    B --> B3["KubernetesCollector<br/>(server, images, policies)"]
    B --> B4["GPUCollector<br/>(nvidia-smi data)"]
    
    B1 & B2 & B3 & B4 --> C[NodeSnapshotter.Measure]
    
    C --> D["Parallel Collection<br/>(errgroup)"]
    
    D --> D1["Go Routine 1: Metadata<br/>• version<br/>• source<br/>• timestamp"]
    D --> D2["Go Routine 2: Kubernetes<br/>• Server Version<br/>• Container Images<br/>• ClusterPolicies"]
    D --> D3["Go Routine 3: SystemD<br/>• containerd.service<br/>• docker.service<br/>• kubelet.service"]
    D --> D4["Go Routine 4: OS Config<br/>• GRUB parameters<br/>• Kernel modules<br/>• Sysctl parameters"]
    D --> D5["Go Routine 5: GPU<br/>• nvidia-smi properties<br/>• driver, CUDA, etc."]
    
    D1 & D2 & D3 & D4 & D5 --> E["All goroutines complete<br/>or first error returns"]
    
    E --> F["Snapshot Structure<br/>kind: Snapshot<br/>apiVersion: aicr.nvidia.com/v1alpha1<br/>measurements: [k8s, systemd, os, gpu]"]
    
    F --> G[serializer.NewFileWriterOrStdout]
    
    G --> G1["Format: JSON/YAML/Table"]
    G --> G2["Output: stdout or file"]
Loading

Usage Examples

# Output to stdout in JSON format
aicr snapshot

# Save to file in YAML format
aicr snapshot --output system.yaml --format yaml

# Human-readable table format
aicr snapshot --format table

# ConfigMap output (Kubernetes-native)
aicr snapshot --output cm://gpu-operator/aicr-snapshot

Agent Deployment Pattern

The snapshot command can be deployed as a Kubernetes Job for automated cluster auditing:

flowchart TD
    A["Kubernetes Job<br/>aicr snapshot"] --> B{"Has RBAC?"}
    B -->|Yes| C["Write to ConfigMap<br/>aicr-snapshot"]
    B -->|No| D["Error: Forbidden"]
    
    C --> E["External CLI<br/>aicr recipe<br/>-s cm://ns/snap"]
    
    E --> F["Generate Recipe<br/>from ConfigMap"]
    
    F --> G["Bundle Generation"]
    
    style C fill:#90EE90
    style D fill:#FFB6C1
Loading

Deployment:

apiVersion: batch/v1
kind: Job
metadata:
  name: aicr
  namespace: gpu-operator
spec:
  template:
    spec:
      serviceAccountName: aicr
      containers:
      - name: aicr
        image: ghcr.io/nvidia/aicr:latest
        command:
        - aicr
        - snapshot
        - --output
        - cm://gpu-operator/aicr-snapshot
      restartPolicy: Never

RBAC Requirements:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aicr
  namespace: gpu-operator
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: aicr
  namespace: gpu-operator
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list", "create", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: aicr
  namespace: gpu-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: aicr
subjects:
- kind: ServiceAccount
  name: aicr
  namespace: gpu-operator  # Must match ServiceAccount namespace

Key Points:

  • No volumes needed - writes directly via Kubernetes API
  • RBAC RoleBinding must reference correct namespace
  • ConfigMap automatically created if doesn't exist
  • Supports update pattern (overwrite existing snapshots)
  • RBAC and Job resources are created programmatically by pkg/k8s/agent

### Recipe Command: `pkg/cli/recipe.go`

Generates optimized configuration recipes based on environment parameters.

#### Command Flow

```mermaid
flowchart TD
    A[User Flags] --> B[Build Query from Flags]
    B --> C[Parse & Validate Versions]
    C --> D[recipe.BuildRecipe]
    D --> E["Load Recipe Store<br/>(embedded YAML)"]
    E --> F[Match Overlays]
    F --> G[Merge Measurements]
    G --> H[Serialize Output]
    H --> I[Write to stdout/file]

Detailed Data Flow

flowchart TD
    A[Recipe Command] --> B[buildQueryFromCmd]
    
    B --> B1["Parse CLI Flags:<br/>--service, --accelerator/--gpu<br/>--intent, --os, --nodes"]
    B1 --> B2["Version Parsing:<br/>• ParseVersion for osv, kernel, k8s<br/>• Reject negative components<br/>• Support precision (1.2.3, 1.2, 1)"]
    
    B2 --> C[recipe.BuildRecipe]
    
    C --> C1["Step 1: Load Recipe Store<br/>(embedded YAML, cached)"]
    C1 --> C2["Step 2: Clone Base Measurements<br/>(deep copy: os, systemd, k8s, gpu)"]
    C2 --> C3["Step 3: Match Overlays<br/>• For each overlay: IsMatch(query)<br/>• Asymmetric: recipe any=wildcard<br/>• Query any ≠ specific recipe"]
    C3 --> C4["Step 4: Merge Overlay Measurements<br/>• Index by measurement.Type<br/>• Merge subtypes by name<br/>• Overlay data takes precedence"]
    C4 --> C5["Step 5: Strip Context<br/>(if not requested)"]
    C5 --> C6["Recipe Structure:<br/>request,<br/>measurements"]
    
    C6 --> D["serializer.NewFileWriterOrStdout<br/>(JSON/YAML/Table)"]
Loading

Recipe Matching Algorithm

The recipe matching uses an asymmetric rule-based query system where overlay criteria (rules) match against user queries (candidates):

# Overlay file (eks.yaml)
spec:
  criteria:
    service: eks          # Rule: query must have service=eks
                         # Other fields empty = wildcards (match any query value)

Asymmetric Matching Rules:

  1. All non-empty fields in the overlay criteria must be satisfied by the query
  2. Empty overlay field → Wildcard (matches any query value)
  3. Query "any" field → Only matches overlay "any" (does NOT match specific overlays)
  4. Version fields use semantic version equality with precision awareness

This asymmetric behavior ensures generic queries (e.g., --service eks --intent training) don't match overly specific recipes (e.g., recipes requiring accelerator: gb200).

Usage Examples

# Basic recipe for Ubuntu with gb200 GPU
aicr recipe --os ubuntu --gpu gb200

# Full specification with all parameters
aicr recipe \
  --service eks \
  --accelerator gb200 \
  --intent training \
  --os ubuntu \
  --nodes 8 \
  --format yaml \
  --output recipe.yaml

# Inference workload on GKE  
aicr recipe --service gke --gpu gb200 --intent inference

# Snapshot mode - analyze captured snapshot for training
aicr recipe --snapshot system.yaml --intent training

# Snapshot mode - analyze for inference optimization
aicr recipe \
  --snapshot cluster-snapshot.yaml \
  --intent inference \
  --format yaml \
  --output recipe.yaml

Recipe Command Modes

The recipe command supports two modes of operation:

Query Mode (Default)

Direct recipe generation from environment parameters:

flowchart TD
    A[User Invocation] --> B[Parse Flags<br/>service, accelerator, intent, os, nodes]
    B --> C[Build Criteria Object]
    C --> D[recipe.BuildFromCriteria]
    D --> E[Match Overlays in Data Store]
    E --> F[Apply Overlays]
    F --> G[Generate Recipe]
    G --> H[Serialize Output]
    H --> I[Write to stdout/file]
Loading

Snapshot Mode

Analyze captured snapshots and generate tailored recipes:

flowchart TD
    A[User Invocation] --> B[Parse Flags<br/>snapshot, intent, format, output]
    B --> C[Load Snapshot from File]
    C --> D[recipe.BuildFromSnapshot]
    D --> E[Extract Query from Snapshot]
    E --> F[Match Rules in Data Store]
    F --> G[Apply Overlays]
    G --> H[Generate Recipe]
    H --> I[Serialize Output]
    H --> J[Write to stdout/file]
Loading

Query Extraction from Snapshot

When using snapshot mode, the recipe builder extracts environment parameters from the snapshot:

From OS Measurements:

  • release subtype → OS family (ubuntu, rhel, cos, amazonlinux)

From Kubernetes Measurements:

  • server subtype → K8s service provider (eks, gke, aks) inferred from images

From GPU Measurements:

  • Product Name → GPU type detection (H100, GB200, A100, L40)
  • Maps product names to normalized accelerator types for recipe matching

Intent Types:

  • training – Optimize for high throughput, batch processing, multi-GPU orchestration
  • inference – Optimize for low latency, single-request performance, efficient batching
  • any – Provides general-purpose recommendations applicable to both workloads

External Data Directory

The --data flag enables extending embedded recipe data with external files:

flowchart TD
    A[Embedded Data<br/>recipes/] --> C[Layered Data Provider]
    B[External Directory<br/>--data ./my-data/] --> C
    C --> D[Recipe Generation]

    subgraph Merge Behavior
        E[registry.yaml] -->|Merged| F[Combined Registry]
        G[Other Files] -->|Replaced| H[External Takes Precedence]
    end
Loading

Requirements:

  • External directory must contain registry.yaml
  • No symlinks allowed (security)
  • Max file size: 10MB per file

Merge Rules:

  • registry.yaml: Components merged by name (external overrides embedded)
  • All other files: External replaces embedded if path matches

Usage Examples

# Query mode - generate recipe from parameters
aicr recipe --os ubuntu --service eks --accelerator h100 --intent training

# Snapshot mode - analyze snapshot for training workloads
aicr recipe --snapshot system.yaml --intent training

# Snapshot mode with output file
aicr recipe -s system.yaml -i inference -o recipe.yaml

# Query mode with full specification
aicr recipe \
  --service eks \
  --accelerator gb200 \
  --intent training \
  --os ubuntu \
  --platform kubeflow \
  --nodes 8 \
  --format yaml

# Use external data directory
aicr recipe --service eks --accelerator h100 --data ./my-custom-data

# Bundle with external data
aicr bundle --recipe recipe.yaml --data ./my-custom-data --output ./bundles

Recipe Output Structure

apiVersion: aicr.nvidia.com/v1alpha1
kind: Recipe
metadata:
  version: v1.0.0
  created: "2025-01-15T10:30:00Z"
  appliedOverlays:
    - base
    - eks
    - eks-training
    - gb200-eks-training
    - gb200-eks-ubuntu-training
criteria:
  service: eks
  accelerator: gb200
  intent: training
  os: ubuntu
  nodes: 8
componentRefs:
  - name: gpu-operator
    version: v25.3.3
    order: 1
    repository: https://helm.ngc.nvidia.com/nvidia
  - name: network-operator
    version: v25.4.0
    order: 2
    repository: https://helm.ngc.nvidia.com/nvidia
constraints:
  driver:
    version: "580.82.07"
    cudaVersion: "13.1"

Error Handling

  • Query Mode:

    • Invalid parameter values: Returns error with supported options
    • Missing required parameters: Allows "any" as default fallback
    • No matching overlays: Returns recipe with base configuration
  • Snapshot Mode:

    • Missing snapshot file: File not found error with path
    • Invalid snapshot format: Parse error with details
    • Invalid intent: Returns error with supported intent types (training, inference, any)
    • Extraction failures: Best-effort extraction with partial criteria

Common Errors:

  • Unknown output format: Error with supported formats list (json, yaml)

Query Command: pkg/cli/query.go

Extracts specific values from the fully hydrated recipe configuration using dot-path selectors.

Command Flow

flowchart TD
    A[User Flags + --selector] --> B[Build Recipe from Criteria]
    B --> C[recipe.HydrateResult]
    C --> D["Inline GetValuesForComponent<br/>for each ComponentRef"]
    D --> E["recipe.Select(hydrated, selector)"]
    E --> F{Scalar?}
    F -->|Yes| G[Print plain text]
    F -->|No| H[Print YAML/JSON]
Loading

Hydration Process

The query command builds a fully hydrated map[string]any from the RecipeResult:

  1. Recipe-level fields (criteria, metadata, deploymentOrder, constraints) are mapped directly
  2. Each ComponentRef is expanded into a component map with metadata fields (name, chart, source, version, etc.)
  3. GetValuesForComponent is called per component to merge base values, overlay values, and inline overrides
  4. The merged values are inlined under each component's values key

Selector Resolution

The selector uses dot-delimited path walking. Leading dots are stripped (yq-style), so .components.X and components.X are equivalent. An empty selector or . returns the entire hydrated map.

Usage Examples

# Scalar value — plain text output
aicr query --service eks --accelerator h100 --intent training \
  --selector components.gpu-operator.values.driver.version

# Subtree — YAML output
aicr query --service eks --accelerator h100 --intent training \
  --selector components.gpu-operator.values.driver

# Shell-friendly for scripting
VERSION=$(aicr query --service eks --accelerator h100 --intent training \
  --selector components.gpu-operator.values.driver.version)

Implementation: pkg/recipe/query.go (HydrateResult, Select)

Bundle Command: pkg/cli/bundle.go

Generates deployment-ready bundles (Helm values, Kubernetes manifests, installation scripts) from recipes.

Command Flow

flowchart TD
    A[User Invocation] --> B[Parse Flags<br/>recipe, bundlers, output]
    B --> C[Parse Bundler Types]
    C --> D[Load Recipe from File]
    D --> E[Create DefaultBundler]
    E --> F[Execute Bundlers<br/>Parallel by Default]
    F --> G[Collect Results]
    G --> H[Check for Errors]
    H --> I[Log Summary]
    I --> J[Return Status]
Loading

Detailed Data Flow

flowchart TD
    A[Bundle Command] --> B[Parse CLI Flags]
    
    B --> B1["--recipe (required)<br/>--output (default: .)"]
    
    B1 --> C[serializer.FromFile Recipe]
    
    C --> D[bundler.New]
    
    D --> D1["Build DefaultBundler:<br/>• All recipe components bundled<br/>• Parallel execution"]
    
    D1 --> E[DefaultBundler.Make]
    
    E --> E1["Select Bundlers:<br/>• All recipe components"]
    
    E1 --> E2["Parallel Execution:<br/>• errgroup.WithContext<br/>• One goroutine per bundler<br/>• Concurrent file generation"]
    
    E2 --> E3["Each Bundler:<br/>1. Validate recipe (optional interface)<br/>2. Configure (optional interface)<br/>3. Generate bundle files<br/>4. Compute checksums<br/>5. Return Result"]
    
    E3 --> F[Aggregate BundleOutput]
    
    F --> F1["Results:<br/>• Files generated<br/>• Total size<br/>• Duration<br/>• Success/error counts"]
    
    F1 --> G[Check HasErrors]
    
    G -->|No Errors| H[Return Success]
    G -->|Has Errors| I[Return Error]
    
    style E2 fill:#ffeb3b
    style E3 fill:#c8e6c9
Loading

Bundler Data Flow

Simplified Architecture (RecipeResult-to-Template):

flowchart TD
    A[RecipeResult] --> B[GetComponentRef]
    A --> C[GetValuesForComponent]
    B --> D[ComponentRef]
    C --> E[Values Map]
    D --> F[generateScriptData]
    E --> F
    F --> G[ScriptData struct]
    E --> H1[Template: values.yaml]
    G --> H2[Template: install.sh]
    G --> H3[Template: README.md]
    H1 & H2 & H3 --> I[Generated Files]
Loading

Key Simplification: Single RecipeResult path (no dual Recipe/RecipeResult routing)
Data Flow: RecipeResult → Values Map + ScriptData → Templates
Templates: Use index .Values "key" for config, .Script.* for metadata

Bundler Architecture

BaseBundler Helper Pattern:

// Bundlers embed BaseBundler and override Make()
type Bundler struct {
    *bundler.BaseBundler  // Provides common functionality
}

func NewBundler() *Bundler {
    return &Bundler{
        BaseBundler: bundler.NewBaseBundler(bundlerType, templatesFS),
    }
}

// Self-register at init time using MustRegister
func init() {
    bundler.MustRegister("gpu-operator", NewBundler())
}

RecipeResult-Based Data Access:

// Get component reference from RecipeResult
component := input.GetComponentRef(Name)
values := input.GetValuesForComponent(Name)

// Generate script metadata
scriptData := generateScriptData(component, values)

// Pass values map to templates (config values)
b.GenerateFileFromTemplate(ctx, GetTemplate, "values.yaml", path, values, 0644)

// Pass ScriptData to scripts (metadata)
b.GenerateFileFromTemplate(ctx, GetTemplate, "install.sh", path, scriptData, 0755)

// Pass combined data to README
readmeData := map[string]interface{}{"Values": values, "Script": scriptData}
b.GenerateFileFromTemplate(ctx, GetTemplate, "README.md", path, readmeData, 0644)

Data Flow: RecipeResult → Values/ScriptData → Template

RecipeResult → GetComponentRef(Name) → ComponentRef
             → GetValuesForComponent(Name) → values map
             → generateScriptData() → ScriptData struct
             → Template ({{ index .Values "key" }} or {{ .Script.Namespace }})

Registry Pattern:

// Dynamic bundler discovery
bundlers := defaultRegistry.GetAll()  // Returns all registered bundlers
bundlers := defaultRegistry.Get(type) // Returns specific bundler

// MustRegister panics on duplicate types (fail-fast)
bundler.MustRegister("gpu-operator", NewBundler())

DefaultBundler Options:

  • WithBundlerTypes([]BundleType) – Specify bundler types (empty = all registered)
  • WithFailFast(bool) – Stop on first error (default: false/collect all)
  • WithConfig(*Config) – Provide bundler configuration
  • WithRegistry(*Registry) – Use custom bundler registry

Execution:

  • Parallel execution by default: Uses errgroup.WithContext for concurrent execution
    • All bundlers run concurrently when no types specified
    • Faster for multiple bundlers
    • Context cancellation propagates to all bundlers
    • Bundlers are stateless (thread-safe by design)
    • BaseBundler provides thread-safe operations

Architecture Benefits:

  • 75% less code per bundler (BaseBundler eliminates boilerplate)
  • 34% less test code (TestHarness standardizes testing)
  • 15+ internal helpers for recipe parsing
  • Automatic registration via init() functions
  • Fail-fast on duplicate bundler types

Usage Examples

# Generate all recipe components (parallel by default)
aicr bundle --recipe recipe.yaml --output ./bundles

# Use short flags
aicr bundle -r recipe.yaml -o ./bundles

# Override values at generation time
aicr bundle -r recipe.yaml \
  --set gpuoperator:gds.enabled=true \
  --set gpuoperator:driver.version=570.86.16 \
  -o ./bundles

# Override values for multiple components
aicr bundle -r recipe.yaml \
  --set gpuoperator:mig.strategy=mixed \
  --set networkoperator:rdma.enabled=true \
  -o ./bundles

# Schedule system components on system node pool
aicr bundle -r recipe.yaml \
  --system-node-selector nodeGroup=system-pool \
  --system-node-toleration dedicated=system:NoSchedule \
  -o ./bundles

# Schedule GPU workloads on labeled GPU nodes
aicr bundle -r recipe.yaml \
  --accelerated-node-selector nvidia.com/gpu.present=true \
  --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
  -o ./bundles

Bundle Output Structure

./bundles/
├── gpu-operator/
│   ├── values.yaml              # Helm chart values
│   ├── manifests/
│   │   └── clusterpolicy.yaml  # ClusterPolicy CR
│   ├── scripts/
│   │   ├── install.sh          # Installation script
│   │   └── uninstall.sh        # Cleanup script
│   ├── README.md                # Deployment instructions
│   └── checksums.txt            # SHA256 verification
├── network-operator/
│   ├── values.yaml
│   ├── manifests/
│   │   └── nicclusterpolicy.yaml
│   ├── scripts/
│   ├── README.md
│   └── checksums.txt
├── cert-manager/
│   ├── values.yaml
│   ├── README.md
│   └── checksums.txt
├── nvsentinel/
│   ├── values.yaml
│   ├── README.md
│   └── checksums.txt
└── skyhook/
    ├── values.yaml
    ├── manifests/
    │   └── skyhook.yaml
    ├── README.md
    └── checksums.txt

Error Handling

Validation Errors:

  • Missing recipe file: File not found error with path
  • Invalid recipe format: Parse error with details
  • Invalid bundler type: Error with list of supported types
  • Empty measurements: Recipe validation failure

Execution Errors:

  • FailFast=false (default): Collects all errors, continues execution
    • Returns partial results with error list
    • Exit code indicates failure count
  • FailFast=true: Stops on first bundler error
    • Returns immediately with error
    • Subsequent bundlers not executed

Common Error Scenarios:

# Missing recipe file
$ aicr bundle --output ./bundles
Error: required flag "recipe" not set

# Bundler failures (FailFast=false)
$ aicr bundle -r recipe.yaml
Error: bundle generation completed with errors: 1/2 bundlers failed

CLI Integration

The bundle command integrates with the CLI through:

  1. Shared Serializer: Uses same serializer.FromFile for recipe loading
  2. Structured Logging: Consistent slog structured logging
  3. Context Propagation: Respects context cancellation
  4. Error Patterns: Uses same error handling conventions

Log Output Example:

INFO  generating bundle recipeFilePath=recipe.yaml outputDir=./bundles bundlerTypes=[gpu-operator]
INFO  starting bundle generation bundler_count=1 output_dir=./bundles
INFO  bundler completed bundler_type=gpu-operator files=5 size_bytes=12458 duration=45ms
INFO  bundle generation complete summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."
INFO  bundle generation completed success=1 errors=0 duration_sec=0.045 summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."

Common Errors:

Shared Infrastructure

Collector Factory Pattern

The CLI uses the Factory Pattern for collector instantiation, enabling:

  • Testability: Inject mock collectors for unit tests
  • Flexibility: Easy to add new collector types
  • Encapsulation: Hide collector creation complexity
type Factory interface {
    CreateSystemDCollector() Collector
    CreateOSCollector() Collector
    CreateKubernetesCollector() Collector
    CreateGPUCollector() Collector
}

Serializer Abstraction

Output formatting is abstracted through the serializer.Serializer interface:

type Serializer interface {
    Serialize(data interface{}) error
}

Implementations:

  • JSON: encoding/json with 2-space indent
  • YAML: gopkg.in/yaml.v3
  • Table: text/tabwriter for columnar display

Measurement Data Model

All collected data uses a unified measurement.Measurement structure:

type Measurement struct {
    Type     Type      // os, k8s, systemd, gpu
    Subtypes []Subtype // Named collections of readings
}

type Subtype struct {
    Name    string                // grub, kmod, sysctl, server, image, etc.
    Data    map[string]Reading    // Key-value readings
    Context map[string]string     // Human-readable descriptions
}

type Reading struct {
    Value interface{}  // Actual value (int, string, bool, float64)
}

Error Handling

CLI Error Strategy

  1. Flag Validation: User-friendly error messages for invalid flags
  2. Version Parsing: Specific error types (ErrNegativeComponent, etc.)
  3. Collector Failures: Log errors, continue with partial data where possible
  4. Serialization Errors: Fatal - abort and report
  5. Exit Codes: Non-zero exit code on any failure

Example Error Messages

# Invalid accelerator type
$ aicr recipe --accelerator invalid-gpu
Error: invalid accelerator type: must be one of h100, gb200, a100, l40, any

# Unknown output format
$ aicr snapshot --format xml
Error: unknown output format: "xml"

# Missing required parameters
$ aicr recipe
# Still succeeds - generates base recipe with no overlays

Performance Characteristics

Snapshot Command

  • Parallel Collection: All collectors run concurrently via errgroup
  • Typical Duration: 100-500ms depending on cluster size
  • Memory Usage: ~10-50MB for typical workloads
  • Scalability: O(n) with number of pods/nodes for K8s collector

Recipe Command

  • Store Loading: Once per process (cached via sync.Once)
  • Typical Duration: <10ms after initial load
  • Memory Usage: ~5-10MB (embedded YAML + parsed structure)
  • Scalability: O(m) with number of overlays (typically <100)

Build Configuration

Version Injection via ldflags

Build-time version information injection:

VERSION ?= $(shell git describe --tags --always --dirty)
COMMIT ?= $(shell git rev-parse --short HEAD)
DATE ?= $(shell date -u +%Y-%m-%dT%H:%M:%SZ)

LDFLAGS := -X github.com/NVIDIA/aicr/pkg/cli.version=$(VERSION)
LDFLAGS += -X github.com/NVIDIA/aicr/pkg/cli.commit=$(COMMIT)
LDFLAGS += -X github.com/NVIDIA/aicr/pkg/cli.date=$(DATE)

go build -ldflags="$(LDFLAGS)" -o bin/aicr ./cmd/aicr

Testing Strategy

Unit Tests

  • Flag parsing and validation
  • Version parsing and error handling
  • Query building from command flags
  • Serializer format selection

Integration Tests

  • Mock collectors for deterministic output
  • Full command execution with fake factory
  • Output format validation

Example Test Structure

func TestSnapshotCommand(t *testing.T) {
    // Create mock factory
    mockFactory := &MockFactory{
        k8s:     mockK8sCollector,
        systemd: mockSystemDCollector,
        os:      mockOSCollector,
        gpu:     mockGPUCollector,
    }
    
    // Execute snapshot with mock
    snapshotter := NodeSnapshotter{
        Factory: mockFactory,
        Serializer: &bytes.Buffer{},
    }
    
    err := snapshotter.Measure(ctx)
    assert.NoError(t, err)
}

Dependencies

External Libraries

  • github.com/urfave/cli/v3 - CLI framework
  • golang.org/x/sync/errgroup - Concurrent error handling
  • gopkg.in/yaml.v3 - YAML parsing
  • log/slog - Structured logging

Internal Packages

  • pkg/collector - System data collection
  • pkg/measurement - Data model
  • pkg/recipe - Recipe building
  • pkg/version - Semantic versioning
  • pkg/serializer - Output formatting
  • pkg/logging - Logging configuration
  • pkg/snapshotter - Snapshot orchestration

Future Enhancements

Short-Term (< 3 months)

  1. Caching Layer
    Rationale: Reduce latency for repeated aicr snapshot calls in scripts
    Implementation: sync.Map with TTL-based eviction using time.AfterFunc
    Trade-off: Stale data risk vs 5-10x performance improvement
    Reference: sync.Map

  2. Differential Snapshots
    Use Case: CI/CD pipelines detecting configuration drift
    Implementation: github.com/google/go-cmp/cmp for deep comparison
    Output: JSON Patch (RFC 6902) format for machine consumption
    CLI: aicr diff baseline.yaml current.yaml --format patch

  3. Measurement Filtering
    Use Case: Extract only GPU data without K8s overhead
    CLI: aicr snapshot --filter gpu,os --exclude k8s
    Implementation: Post-collection filtering before serialization
    Performance: Saves 60-70% execution time when K8s excluded

  4. Batch Mode
    Use Case: Fleet-wide configuration auditing (100s of nodes)
    Implementation: Worker pool with errgroup.SetLimit()
    CLI: aicr snapshot --nodes nodes.txt --workers 10 --output results/
    Reference: errgroup Limits

Mid-Term (3-6 months)

  1. Plugin System
    Rationale: Custom collectors without forking codebase
    Interface: type Collector interface { Collect(context.Context) (Measurement, error) }
    Options: Go plugins (unstable across versions) or WASM (safe, portable)
    Security: Sandboxed execution with restricted syscalls
    Reference: WebAssembly System Interface

  2. Configuration Files
    Use Case: Avoid repeating --os, --gpu flags
    Format: YAML following XDG Base Directory spec
    Location: ~/.config/aicr/config.yaml (Linux/macOS), %APPDATA%\aicr\config.yaml (Windows)
    Example:

    defaults:
      os: ubuntu
      gpu: h100
      format: yaml
    server:
      url: https://recipe-api.example.com
  3. Watch Mode
    Implementation: Hybrid of fsnotify + periodic polling
    CLI: aicr snapshot --watch --interval 30s --on-change ./alert.sh
    Output: Stream of JSON diffs to stdout
    Use Case: Real-time monitoring with alerting

  4. Schema Validation
    Use Case: Ensure snapshots conform to API version spec
    Implementation: Embed JSON Schema in binary with go:embed
    Library: github.com/santhosh-tekuri/jsonschema/v5 (fastest Go validator)
    CLI: aicr validate --schema v1 snapshot.json

Long-Term (6-12 months)

  1. gRPC Mode
    Rationale: Better streaming, 3-5x smaller payloads than JSON
    Implementation: Bi-directional streaming with protobuf
    Trade-off: Added complexity (proto definitions) vs performance gains
    Reference: gRPC Go

  2. Distributed Tracing
    Use Case: Debug performance issues across collectors
    Implementation: OpenTelemetry SDK with span per collector
    Exporter: OTLP to Jaeger/Tempo
    CLI: aicr snapshot --trace --trace-endpoint localhost:4317
    Reference: OpenTelemetry Go

  3. Policy Enforcement
    Use Case: Block non-compliant configs in CI/CD
    Implementation: Embed OPA (github.com/open-policy-agent/opa)
    CLI: aicr validate --policy policy.rego snapshot.yaml
    Exit Code: 0 = pass, 1 = policy violations
    Reference: OPA Go Integration

  4. Cloud Storage Integration
    Use Case: Centralized storage for fleet management
    CLI: aicr snapshot --upload s3://bucket/snapshots/$(hostname).yaml
    Implementation: AWS SDK v2 with resumable uploads
    Authentication: IAM roles, service accounts, credential chain
    Reference: AWS SDK for Go V2

Production Deployment Patterns

Pattern 1: CI/CD Integration

Use Case: Automated configuration validation in build pipelines

GitLab CI Example:

validate_gpu_config:
  stage: test
  image: ghcr.io/nvidia/aicr:latest
  script:
    - aicr snapshot --format json > snapshot.json
    # Validate against known-good baseline
    - diff -u expected_snapshot.json snapshot.json
    # Or use OPA policy (future enhancement)
    # - aicr validate --policy policies/gpu_baseline.rego snapshot.json
  only:
    - merge_requests
  artifacts:
    when: on_failure
    paths:
      - snapshot.json

GitHub Actions Example:

name: Validate GPU Configuration
on:
  pull_request:
    paths:
      - 'ansible/**'
      - 'terraform/**'

jobs:
  validate:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      
      - name: Install aicr
        run: |
          curl -sfL https://raw.githubusercontent.com/.../installer | bash -s --
          echo "$HOME/.local/bin" >> $GITHUB_PATH
      
      - name: Capture snapshot
        run: aicr snapshot --format yaml --output snapshot.yaml
      
      - name: Generate recipe
        run: aicr recipe --os ubuntu --gpu h100 > recipe.yaml
      
      - name: Compare configurations
        run: |
          yq eval '.measurements[] | select(.type=="GPU")' snapshot.yaml > actual_gpu.yaml
          yq eval '.measurements[] | select(.type=="GPU")' recipe.yaml > expected_gpu.yaml
          diff -u expected_gpu.yaml actual_gpu.yaml || \
            (echo "::error::GPU configuration drift detected" && exit 1)
      
      - name: Upload artifact
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: configuration-drift
          path: |
            snapshot.yaml
            recipe.yaml

Jenkins Pipeline:

pipeline {
    agent { label 'gpu-node' }
    
    stages {
        stage('Snapshot') {
            steps {
                sh 'aicr snapshot --format json > snapshot.json'
            }
        }
        
        stage('Validate') {
            steps {
                script {
                    def snapshot = readJSON file: 'snapshot.json'
                    def gpuDriver = snapshot.measurements
                        .find { it.type == 'GPU' }
                        .subtypes.find { it.subtype == 'smi' }
                        .data.'driver-version'
                    
                    if (gpuDriver != '570.158.01') {
                        error("Incorrect GPU driver: ${gpuDriver}")
                    }
                }
            }
        }
    }
    
    post {
        always {
            archiveArtifacts artifacts: 'snapshot.json', fingerprint: true
        }
    }
}

Pattern 2: Scheduled Auditing

Use Case: Nightly configuration drift detection across fleet

Kubernetes CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: aicr-audit
  namespace: monitoring
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  concurrencyPolicy: Forbid  # Prevent overlapping runs
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: aicr-audit
        spec:
          serviceAccountName: aicr
          nodeSelector:
            node-role.kubernetes.io/gpu: "true"
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          containers:
          - name: aicr
            image: ghcr.io/nvidia/aicr:v0.6.4
            command:
              - /bin/sh
              - -c
              - |
                set -e
                TIMESTAMP=$(date +%Y%m%d-%H%M%S)
                HOSTNAME=$(hostname)
                
                # Capture snapshot
                aicr snapshot --format yaml > /tmp/snapshot.yaml
                
                # Store as ConfigMap with retention
                kubectl create configmap \
                  "aicr-snapshot-${HOSTNAME}-${TIMESTAMP}" \
                  --from-file=snapshot=/tmp/snapshot.yaml \
                  --dry-run=client -o yaml | \
                kubectl apply -f -
                
                # Cleanup old snapshots (keep last 30 days)
                kubectl get configmaps -l aicr-snapshot=true \
                  --sort-by=.metadata.creationTimestamp | \
                head -n -30 | \
                xargs -r kubectl delete configmap
            resources:
              limits:
                memory: 256Mi
              requests:
                cpu: 100m
                memory: 128Mi
          restartPolicy: OnFailure

Systemd Timer (Bare Metal):

# /etc/systemd/system/aicr-audit.service
[Unit]
Description=AICR Configuration Audit
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/aicr snapshot --format json --output /var/log/aicr/snapshot-%Y%m%d.json
User=aicr
Group=aicr

# Hardening
PrivateTmp=true
NoNewPrivileges=true
ReadOnlyPaths=/usr /etc
ReadWritePaths=/var/log/aicr

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/aicr-audit.timer
[Unit]
Description=AICR Audit Timer

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

Enable with:

sudo systemctl enable --now aicr-audit.timer
sudo systemctl list-timers aicr-audit.timer

Pattern 3: Fleet Management

Use Case: Collect snapshots from 100s of GPU nodes in parallel

Ansible Playbook:

---
- name: Collect AICR Snapshots from GPU Fleet
  hosts: gpu_nodes
  gather_facts: yes
  serial: 10  # Process 10 nodes at a time
  tasks:
    - name: Ensure aicr is installed
      stat:
        path: /usr/local/bin/aicr
      register: aicr_binary
      failed_when: not aicr_binary.stat.exists
    
    - name: Collect snapshot
      shell: aicr snapshot --format json
      register: snapshot
      changed_when: false
      failed_when: snapshot.rc != 0
    
    - name: Upload to S3
      aws_s3:
        bucket: fleet-snapshots
        object: "{{ inventory_hostname }}/{{ ansible_date_time.iso8601 }}.json"
        content: "{{ snapshot.stdout }}"
        mode: put
      delegate_to: localhost
      run_once: false
    
    - name: Validate against baseline
      shell: |
        echo '{{ snapshot.stdout }}' | \
        jq '.measurements[] | select(.type=="GPU") | .subtypes[] | 
            select(.subtype=="smi") | .data."driver-version"'
      register: driver_version
      failed_when: driver_version.stdout != '"570.158.01"'
      changed_when: false

- name: Generate Fleet Report
  hosts: localhost
  tasks:
    - name: Download all snapshots
      aws_s3:
        bucket: fleet-snapshots
        mode: list
      register: s3_objects
    
    - name: Aggregate results
      script: scripts/aggregate_snapshots.py
      args:
        snapshots: "{{ s3_objects.s3_keys }}"

Terraform Provisioning:

resource "null_resource" "aicr_snapshot" {
  count = length(var.gpu_instance_ids)
  
  provisioner "remote-exec" {
    inline = [
      "aicr snapshot --format json > /tmp/snapshot.json",
      "aws s3 cp /tmp/snapshot.json s3://fleet-snapshots/${self.id}/"
    ]
    
    connection {
      type        = "ssh"
      host        = element(var.gpu_instance_ips, count.index)
      user        = "ubuntu"
      private_key = file("~/.ssh/id_rsa")
    }
  }
  
  triggers = {
    instance_id = element(var.gpu_instance_ids, count.index)
    timestamp   = timestamp()
  }
}

data "aws_s3_objects" "snapshots" {
  bucket     = "fleet-snapshots"
  depends_on = [null_resource.aicr_snapshot]
}

output "snapshot_count" {
  value = length(data.aws_s3_objects.snapshots.keys)
}

Pattern 4: Real-Time Monitoring

Use Case: Continuous configuration monitoring with Prometheus alerting

Prometheus Exporter (future enhancement):

package main

import (
    "context"
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "github.com/NVIDIA/aicr/pkg/snapshotter"
)

var (
    gpuDriverVersion = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "aicr_gpu_driver_version",
            Help: "NVIDIA driver version (encoded as float)",
        },
        []string{"node", "gpu_model"},
    )
    
    k8sVersion = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "aicr_k8s_version",
            Help: "Kubernetes version (encoded)",
        },
        []string{"node"},
    )
)

func init() {
    prometheus.MustRegister(gpuDriverVersion, k8sVersion)
}

func collectMetrics() {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()
    
    for range ticker.C {
        ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        snapshot, err := snapshotter.Measure(ctx)
        cancel()
        
        if err != nil {
            log.Printf("Snapshot failed: %v", err)
            continue
        }
        
        // Extract and export GPU driver version
        for _, m := range snapshot.Measurements {
            if m.Type == "GPU" {
                for _, st := range m.Subtypes {
                    if st.Subtype == "smi" {
                        version := st.Data["driver-version"]
                        encoded := encodeVersion(version)
                        gpuModel := st.Data["gpu-name"]
                        gpuDriverVersion.WithLabelValues(hostname, gpuModel).Set(encoded)
                    }
                }
            }
        }
    }
}

func main() {
    go collectMetrics()
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":9090", nil)
}

Prometheus Alerting Rules:

groups:
- name: aicr_configuration
  interval: 60s
  rules:
  - alert: GPUDriverVersionMismatch
    expr: |
      count(count by (aicr_gpu_driver_version) (aicr_gpu_driver_version)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Multiple GPU driver versions detected in cluster"
      description: "{{ $value }} different driver versions found"
  
  - alert: KubernetesVersionSkew
    expr: |
      abs(aicr_k8s_version - scalar(avg(aicr_k8s_version))) > 0.01
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "Kubernetes version skew detected on {{ $labels.node }}"
      description: "Node version differs from cluster average"

Advanced Usage Patterns

Snapshot Diffing with jq

#!/bin/bash
# Capture baseline before changes
aicr snapshot --format json > baseline.json

# Apply configuration changes (Ansible, Terraform, etc.)
# ...

# Capture new snapshot
aicr snapshot --format json > current.json

# Diff specific sections
echo "=== GPU Configuration Changes ==="
diff -u \
  <(jq -S '.measurements[] | select(.type=="GPU")' baseline.json) \
  <(jq -S '.measurements[] | select(.type=="GPU")' current.json)

echo "=== Kernel Parameter Changes ==="
diff -u \
  <(jq -S '.measurements[] | select(.type=="os") | .subtypes[] | 
           select(.subtype=="sysctl")' baseline.json) \
  <(jq -S '.measurements[] | select(.type=="os") | .subtypes[] | 
           select(.subtype=="sysctl")' current.json)

# Count total changes
changes=$(diff <(jq -S . baseline.json) <(jq -S . current.json) | grep -c '^[<>]')
echo "Total configuration changes: $changes"

Recipe Generation Pipeline

#!/bin/bash
# Generate recipes for all supported configurations

set -euo pipefail

OUTPUT_DIR="recipes"
mkdir -p "$OUTPUT_DIR"

# GPU types from NVIDIA product line
GPU_TYPES=("h100" "gb200" "a100" "l40" "l4")

# Kubernetes services
K8S_SERVICES=("eks" "gke" "aks" "self-managed")

# OS distributions
OS_TYPES=("ubuntu" "rhel" "cos")

total=0
for gpu in "${GPU_TYPES[@]}"; do
  for service in "${K8S_SERVICES[@]}"; do
    for os in "${OS_TYPES[@]}"; do
      output="${OUTPUT_DIR}/${os}-${service}-${gpu}.yaml"
      
      # Generate recipe
      if aicr recipe --os "$os" --service "$service" --gpu "$gpu" \
           --format yaml > "$output" 2>/dev/null; then
        echo "✓ Generated $output"
        ((total++))
      else
        echo "✗ Failed: $os $service $gpu"
      fi
    done
  done
done

echo "Generated $total recipes"

# Validate all recipes
echo "Validating recipes..."
find "$OUTPUT_DIR" -name '*.yaml' -exec yq eval '.' {} \; > /dev/null
echo "All recipes valid"

# Create index
cat > "$OUTPUT_DIR/README.md" <<EOF
# Configuration Recipes

Generated on $(date -Iseconds)

Total recipes: $total

## Available Configurations

| OS | Service | GPU | File |
|----|---------|-----|------|
EOF

find "$OUTPUT_DIR" -name '*.yaml' -type f | sort | while read -r file; do
  base=$(basename "$file" .yaml)
  IFS='-' read -ra parts <<< "$base"
  echo "| ${parts[0]} | ${parts[1]} | ${parts[2]} | $file |" >> "$OUTPUT_DIR/README.md"
done

Automated Remediation

#!/bin/bash
# Apply recommended configuration from recipe
# WARNING: Modifies system configuration - use with caution

set -euo pipefail

# Capture current state
current=$(aicr snapshot --format json)

# Generate recommended recipe
recipe=$(aicr recipe --os ubuntu --gpu h100 --format json)

# Extract recommended GRUB parameters
recommended_grub=$(echo "$recipe" | jq -r '
  .measurements[] | 
  select(.type=="os") | 
  .subtypes[] | 
  select(.subtype=="grub") | 
  .data | 
  to_entries[] | 
  "\(.key)=\(.value)"' | tr '\n' ' ')

# Extract current GRUB parameters
current_grub=$(echo "$current" | jq -r '
  .measurements[] | 
  select(.type=="os") | 
  .subtypes[] | 
  select(.subtype=="grub") | 
  .data | 
  to_entries[] | 
  "\(.key)=\(.value)"' | tr '\n' ' ')

# Show diff
echo "Current GRUB parameters:"
echo "$current_grub"
echo ""
echo "Recommended GRUB parameters:"
echo "$recommended_grub"
echo ""

# Prompt for confirmation
read -p "Apply changes? (yes/no): " confirm
if [[ "$confirm" != "yes" ]]; then
  echo "Aborted"
  exit 0
fi

# Apply GRUB changes (requires root)
sudo grubby --update-kernel=ALL --args="$recommended_grub"
echo "GRUB configuration updated. Reboot required."

# Apply sysctl changes
echo "$recipe" | jq -r '
  .measurements[] | 
  select(.type=="os") | 
  .subtypes[] | 
  select(.subtype=="sysctl") | 
  .data | 
  to_entries[] | 
  "\(.key) = \(.value)"' | \
sudo tee /etc/sysctl.d/99-aicr-recommended.conf

sudo sysctl --system
echo "Sysctl parameters applied"

# Log changes
echo "$(date -Iseconds): Applied AICR recommendations" | \
sudo tee -a /var/log/aicr-remediation.log

Troubleshooting Guide

Issue: "nvidia-smi not found"

Symptoms: GPU measurements empty, error in logs
Root Cause: NVIDIA driver not installed or not in PATH

Diagnosis:

# Check if nvidia-smi exists
which nvidia-smi
# Expected: /usr/bin/nvidia-smi

# Verify driver installation
nvidia-smi --version
# Expected: NVIDIA-SMI 570.158.01

# Check kernel modules
lsmod | grep nvidia
# Expected: nvidia, nvidia_uvm, nvidia_modeset

# Verify device nodes
ls -l /dev/nvidia*
# Expected: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm

Resolution:

# Ubuntu: Install NVIDIA driver
sudo apt-get update
sudo apt-get install -y nvidia-driver-570

# RHEL: Install from CUDA repo
sudo dnf config-manager --add-repo \
  https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf install -y nvidia-driver:570

# Verify installation
sudo nvidia-smi

# If PATH issue, add to shell profile
echo 'export PATH="/usr/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Issue: "Kubernetes API server unreachable"

Symptoms: K8s measurements empty, "connection refused" error
Root Cause: Not running in cluster, or kubeconfig missing/invalid

Diagnosis:

# Verify cluster connectivity
kubectl cluster-info
# Expected: Kubernetes control plane is running at https://...

# Check kubeconfig
echo $KUBECONFIG
cat ~/.kube/config

# Test API access
kubectl get nodes
# Expected: List of nodes

# Check service account (in-cluster)
ls -l /var/run/secrets/kubernetes.io/serviceaccount/
# Expected: token, ca.crt, namespace

Resolution:

# Option 1: Set KUBECONFIG explicitly
export KUBECONFIG=~/.kube/config
aicr snapshot

# Option 2: Copy admin kubeconfig
sudo cp /etc/kubernetes/admin.conf ~/.kube/config
sudo chown $(id -u):$(id -g) ~/.kube/config

# Option 3: Use service account token (in-cluster)
kubectl create serviceaccount aicr
kubectl create clusterrolebinding aicr --clusterrole=view --serviceaccount=default:aicr

# Option 4: Debug with kubectl proxy
kubectl proxy &
export KUBERNETES_SERVICE_HOST=localhost
export KUBERNETES_SERVICE_PORT=8001
aicr snapshot

Issue: "Snapshot too slow (> 5s)"

Symptoms: Long execution time, timeouts in CI/CD
Root Cause: Large cluster (1000s of pods), slow API server, many GPUs

Diagnosis:

# Enable debug logging to identify slow collectors
aicr --debug snapshot 2>&1 | grep -E 'collector|duration'
# Expected output shows timing per collector:
# time="..." level=debug msg="k8s collector finished" duration=3.2s
# time="..." level=debug msg="gpu collector finished" duration=0.8s

# Check cluster size
kubectl get pods --all-namespaces --no-headers | wc -l
# Large: > 1000 pods

# Check GPU count
nvidia-smi --list-gpus | wc -l
# Many: > 8 GPUs

# Profile execution
time aicr snapshot > /dev/null

Resolution:

# Option 1: Filter to specific collectors (future enhancement)
aicr snapshot --filter gpu,os  # Skip K8s (saves 60-70% time)

# Option 2: Increase timeout (future enhancement)
aicr snapshot --timeout 30s

# Option 3: Use caching for repeated calls
aicr snapshot > /tmp/snapshot.json
# Reuse /tmp/snapshot.json for subsequent analysis

# Option 4: Optimize K8s collector
# Reduce API calls by using label selectors (code change):
# clientset.CoreV1().Pods("").List(ctx, metav1.ListOptions{
#     LabelSelector: "app=gpu-operator",
# })

# Option 5: Run in parallel with errgroup limit
# Already implemented in code, but can tune:
# g.SetLimit(runtime.NumCPU())  // Current: 2

Issue: "Out of memory during snapshot"

Symptoms: Process killed, OOMKilled in K8s, segfault
Root Cause: Large measurement data (10k+ pods, many images)

Diagnosis:

# Check memory usage during snapshot
/usr/bin/time -v aicr snapshot > /dev/null 2>&1
# Look for "Maximum resident set size"

# Monitor memory in real-time
# Terminal 1:
watch -n 1 'ps aux | grep aicr'
# Terminal 2:
aicr snapshot

# In Kubernetes, check OOMKilled events
kubectl get events --field-selector reason=OOMKilling

Resolution:

# Option 1: Use streaming serialization (already implemented)
# Data never fully materialized in memory
aicr snapshot --format json > snapshot.json

# Option 2: Increase memory limit in Kubernetes
kubectl set resources deployment aicr-agent \
  --limits=memory=1Gi \
  --requests=memory=512Mi

# Option 3: Filter measurements (future enhancement)
aicr snapshot --filter gpu,os  # Exclude large K8s data

# Option 4: Optimize code to reduce allocations
# Use object pooling for repeated structs:
var measurementPool = sync.Pool{
    New: func() interface{} {
        return &measurement.Measurement{}
    },
}

# Option 5: Process in batches (code change needed)
# For K8s pods, paginate API calls:
pods, err := clientset.CoreV1().Pods("").List(ctx, metav1.ListOptions{
    Limit: 100,
    Continue: continueToken,
})

Performance Tuning

CPU Profiling

# Build with profiling enabled
go build -o aicr cmd/aicr/main.go

# Capture CPU profile
./aicr snapshot --cpuprofile=cpu.prof

# Analyze profile
go tool pprof cpu.prof
(pprof) top10
# Shows top 10 functions by CPU time

(pprof) list collectContainerImages
# Shows line-by-line CPU usage in specific function

(pprof) web
# Opens interactive graph in browser (requires graphviz)

# Example output interpretation:
# If collectContainerImages is > 50% CPU:
# - Optimize pod iteration
# - Reduce string allocations
# - Cache image parsing results

Memory Profiling

# Capture memory profile
./aicr snapshot --memprofile=mem.prof

# Analyze allocations
go tool pprof -alloc_space mem.prof
(pprof) top10
# Shows top 10 functions by allocations

(pprof) list BuildRecipe
# Check for unnecessary allocations

# Example fixes:
# Before: strings.Split() allocates slice
# After: strings.Index() + slicing avoids allocation

# Before: fmt.Sprintf("%s:%s", name, tag)
# After: var b strings.Builder; b.WriteString(name); b.WriteString(":");

Benchmarking

# Benchmark snapshot performance (10 iterations)
for i in {1..10}; do
  time aicr snapshot --format json > /dev/null
done 2>&1 | grep real | awk '{print $2}' | \
sed 's/0m//' | sed 's/s//' | \
awk '{sum+=$1; count++} END {printf "Average: %.3fs\n", sum/count}'

# Compare formats
echo "JSON:"
time aicr snapshot --format json > /dev/null
echo "YAML:"
time aicr snapshot --format yaml > /dev/null
echo "Table:"
time aicr snapshot --format table > /dev/null

# Expected results:
# JSON:  ~50ms  (fastest, minimal processing)
# YAML:  ~80ms  (indentation overhead)
# Table: ~100ms (string formatting, column alignment)

# Benchmark with different cluster sizes
for pods in 10 100 1000 5000; do
  # Scale test deployment
  kubectl scale deployment test-app --replicas=$pods
  kubectl wait --for=condition=ready pod -l app=test-app --timeout=5m
  
  echo "Cluster with $pods pods:"
  time aicr snapshot --format json > /dev/null
done

Optimization Recommendations

  1. Reduce String Allocations
    Current: fmt.Sprintf("%s:%s", name, tag) allocates
    Optimized: Use strings.Builder for concatenation
    Savings: 20-30% fewer allocations in image collector

  2. Preallocate Slices
    Current: measurements := []Measurement{}
    Optimized: measurements := make([]Measurement, 0, expectedSize)
    Benefit: Avoids slice growth reallocations
    When: Size predictable (e.g., GPU count known)

  3. Pool Large Objects
    Use Case: Measurement structs allocated repeatedly
    Implementation:

    var measurementPool = sync.Pool{
        New: func() interface{} {
            return &measurement.Measurement{}
        },
    }
    
    m := measurementPool.Get().(*measurement.Measurement)
    defer measurementPool.Put(m)

    Reference: sync.Pool

  4. Avoid Reflection
    Current: encoding/json uses reflection
    Optimized: Code-generated marshaling with easyjson
    Benefit: 2-3x faster JSON serialization
    Trade-off: Build complexity vs performance
    Reference: easyjson

  5. Batch API Operations
    Current: Multiple API calls per collector
    Optimized: Aggregate calls where possible
    Example: List all pods once, filter in memory
    Benefit: Reduces API server load, faster execution

  6. Concurrent Collectors
    Current: errgroup with limit
    Tuning: Adjust limit based on collector type

    g.SetLimit(runtime.NumCPU())  // CPU-bound collectors
    g.SetLimit(runtime.NumCPU() * 2)  // I/O-bound collectors

    Reference: errgroup SetLimit

Security Best Practices

Running as Non-Root

CLI:

# CLI runs as current user (no special privileges needed)
aicr snapshot  # Works as non-root

# Verify no setuid/setgid
ls -l $(which aicr)
# Expected: -rwxr-xr-x (not -rwsr-xr-x)

# Verify no capabilities
getcap $(which aicr)
# Expected: (no output)

Kubernetes Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: aicr
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: aicr
        image: ghcr.io/nvidia/aicr:latest
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        volumeMounts:
        - name: tmp
          mountPath: /tmp
      volumes:
      - name: tmp
        emptyDir: {}

Secrets Management

# Never log sensitive data
# aicr already filters passwords/tokens from output

# Verify no secrets in snapshot
aicr snapshot --format json | \
  jq '.measurements[].subtypes[].data | 
      keys | map(select(test("(?i)(password|token|key|secret)"))) | 
      unique'
# Expected: []

# Use environment variables for API credentials (future feature)
export AICR_API_TOKEN=$(vault kv get -field=token secret/aicr)
aicr recipe --os ubuntu --gpu h100

# Or use Kubernetes secrets
kubectl create secret generic aicr-api-creds \
  --from-literal=token=$(vault kv get -field=token secret/aicr)

# Mount in pod:
volumeMounts:
- name: api-creds
  mountPath: /var/run/secrets/aicr
  readOnly: true
volumes:
- name: api-creds
  secret:
    secretName: aicr-api-creds

Input Validation

CLI validates all inputs before processing:

# Invalid OS type
aicr recipe --os invalid_os
# Error: invalid os type "invalid_os", must be one of: ubuntu, rhel, cos

# Invalid version format
aicr recipe --osv -1.0
# Error: invalid version "-1.0": negative version components not allowed

# Invalid GPU type
aicr recipe --gpu h100@latest
# Error: invalid gpu type "h100@latest": special characters not allowed

# Invalid format
aicr snapshot --format xml
# Error: invalid format "xml", must be one of: json, yaml, table

# Path traversal prevention
aicr snapshot --output ../../etc/passwd
# Error: output path escapes current directory

# Verify validation in code:
# pkg/cli/recipe.go:
if !isValidOS(os) {
    return fmt.Errorf("invalid os type %q", os)
}

Network Security

# Verify TLS for API calls (future feature)
aicr recipe --os ubuntu --gpu h100 --debug 2>&1 | grep -i tls
# Expected: "Using TLS 1.3"

# Certificate pinning (future enhancement)
export AICR_API_CERT_FINGERPRINT="sha256:abc123..."
aicr recipe --os ubuntu --gpu h100

# Use corporate proxy with authentication
export HTTPS_PROXY=https://proxy.corp.com:8080
export AICR_PROXY_CA_CERT=/etc/ssl/certs/corp-ca.pem
aicr recipe --os ubuntu --gpu h100

Bundle Command: Deployment Artifact Generation

The bundle command generates deployment-ready bundles from configuration recipes. It transforms recipe data into complete deployment artifacts including Helm charts, Kubernetes manifests, installation scripts, and documentation.

Overview

Purpose: Convert AICR recipes into deployment-ready bundles containing:

  • Helm Values: Chart configuration with version management
  • Kubernetes Manifests: ClusterPolicy and custom resources
  • Scripts: Installation and uninstallation automation
  • Documentation: Deployment instructions and verification steps
  • Checksums: SHA256 verification for all generated files

Key Features: ✅ Registry-based bundler framework - pluggable implementations
✅ Parallel generation - fast bundle creation with errgroup
✅ Template system - embedded templates with go:embed
✅ Functional options - flexible configuration
✅ Type safety - compile-time bundler type checking
✅ Metrics - Prometheus observability

Command Flow

flowchart TD
    A[User Invocation] --> B[Parse Flags]
    B --> C{Recipe or Snapshot?}
    C -->|Recipe| D[Load Recipe]
    C -->|Snapshot| E[Generate Recipe<br/>from Snapshot]
    E --> D
    D --> F[Get Bundlers<br/>from Registry]
    F --> G[Validate Recipe]
    G --> H[Parallel Bundle<br/>Generation]
    H --> I[Write Output]
    I --> J[Display Summary]
    
    style H fill:#ffeb3b
Loading

Usage Examples

# Generate GPU Operator bundle from recipe
aicr bundle --recipe recipe.yaml --output ./bundles

# Generate from snapshot with workload intent
aicr bundle --snapshot system.yaml --intent training --output ./bundles

# Specify bundler types explicitly
aicr bundle --recipe recipe.yaml --bundler gpu-operator --output ./bundles

Bundler Framework Architecture

Component Diagram

flowchart TD
    subgraph "Bundler Framework"
        REG[Registry] --> FAC[Factory Functions]
        FAC --> B1[GPU Operator Bundler]
        FAC --> B2[Network Operator Bundler]
        FAC --> BN[Custom Bundlers...]
    end
    
    subgraph "Bundle Generation"
        MAKE[Make Function] --> VAL[Validate Recipe]
        VAL --> PAR[Parallel Execution]
        
        PAR --> G1[Generate Helm Values]
        PAR --> G2[Generate Manifests]
        PAR --> G3[Generate Scripts]
        PAR --> G4[Generate README]
        PAR --> G5[Generate Checksums]
        
        G1 --> RES[Bundle Result]
        G2 --> RES
        G3 --> RES
        G4 --> RES
        G5 --> RES
    end
    
    RECIPE[Recipe] --> MAKE
    RES --> OUTPUT[Output Directory]
Loading

Data Flow

sequenceDiagram
    participant CLI
    participant Bundle
    participant Bundler
    participant Template
    participant FileSystem
    
    CLI->>Bundle: Execute(recipe, bundlers, outputDir)
    Bundle->>Bundle: Validate recipe structure
    Bundle->>Bundle: Create output directory
    
    par Parallel Bundle Generation
        Bundle->>Bundler: Bundler1.Make()
        Bundle->>Bundler: Bundler2.Make()
    end
    
    Bundler->>Bundler: Extract data from recipe
    Bundler->>Bundler: Build config map
    
    par Parallel File Generation
        Bundler->>Template: Render values.yaml
        Bundler->>Template: Render clusterpolicy.yaml
        Bundler->>Template: Render install.sh
        Bundler->>Template: Render README.md
    end
    
    Template-->>Bundler: Rendered content
    
    Bundler->>FileSystem: Write files
    FileSystem-->>Bundler: File paths
    
    Bundler->>Bundler: Compute checksums
    Bundler-->>Bundle: BundleResult
    Bundle-->>CLI: BundleOutput
Loading

Design Patterns

1. Registry Pattern

Problem: How to support multiple bundler implementations without tight coupling?

Solution: Global registry with factory functions for bundler instantiation.

// pkg/bundler/registry.go
var (
    registry = make(map[BundleType]BundlerFactory)
    mu       sync.RWMutex
)

type BundlerFactory func() Bundler

// Register adds a bundler factory to the registry
func Register(bundleType BundleType, factory BundlerFactory) {
    mu.Lock()
    defer mu.Unlock()
    registry[bundleType] = factory
}

// GetBundlers returns bundlers for specified types
func GetBundlers(types ...BundleType) []Bundler {
    mu.RLock()
    defer mu.RUnlock()
    
    var bundlers []Bundler
    for _, t := range types {
        if factory, ok := registry[t]; ok {
            bundlers = append(bundlers, factory())
        }
    }
    return bundlers
}

Benefits:

  • ✅ Decoupled registration - bundlers self-register via init()
  • ✅ Runtime extensibility - add bundlers without modifying core
  • ✅ Declarative configuration - components defined in registry.yaml
  • ✅ Thread-safe - RWMutex protects concurrent access

2. Functional Options Pattern

Problem: How to configure bundlers without breaking API compatibility?

Solution: Variadic option functions for flexible configuration.

// Configuration options
type Option func(*Bundler)

func WithNamespace(ns string) Option {
    return func(b *Bundler) {
        b.config.Namespace = ns
    }
}

func WithFailFast(failFast bool) Option {
    return func(b *Bundler) {
        b.config.FailFast = failFast
    }
}

// Constructor with options
func NewBundler(opts ...Option) *Bundler {
    b := &Bundler{
        config: DefaultBundlerConfig(),
    }
    for _, opt := range opts {
        opt(b)
    }
    return b
}

3. Template-Based Generation

Problem: How to separate content structure from generation logic?

Solution: Embedded text templates with data-driven rendering.

//go:embed templates/*.tmpl
var templatesFS embed.FS

func (b *Bundler) renderTemplate(name string, 
    data map[string]interface{}) (string, error) {
    
    tmpl, err := template.New(name).
        Funcs(templateFuncs()).
        ParseFS(templatesFS, "templates/"+name+".tmpl")
    if err != nil {
        return "", fmt.Errorf("failed to parse template: %w", err)
    }
    
    var buf bytes.Buffer
    if err := tmpl.Execute(&buf, data); err != nil {
        return "", fmt.Errorf("failed to execute template: %w", err)
    }
    
    return buf.String(), nil
}

GPU Operator Bundler

The GPU Operator bundler generates a complete deployment bundle for NVIDIA GPU Operator, extracting configuration from recipe measurements.

Recipe Data Extraction

K8s Measurements (measurement.TypeK8s):

  1. Image Subtype - Component versions:

    - subtype: image
      data:
        gpu-operator: v25.3.3
        driver: 580.82.07
        container-toolkit: v1.17.8
        k8s-device-plugin: v0.17.4
        dcgm: 4.3.1-1
        dcgm-exporter: 4.3.1
  2. Config Subtype - Boolean flags:

    - subtype: config
      data:
        cdi: true
        mig: false
        rdma: true
        useOpenKernelModule: true

GPU Measurements (measurement.TypeGPU):

- subtype: smi
  data:
    driver-version: 580.82.07
    cuda-version: "13.1"

Generated Bundle Structure

gpu-operator/
├── values.yaml                    # Helm chart configuration
├── manifests/
│   └── clusterpolicy.yaml        # ClusterPolicy custom resource
├── scripts/
│   ├── install.sh                # Installation automation
│   └── uninstall.sh              # Cleanup automation
├── README.md                      # Deployment instructions
└── checksums.txt                  # SHA256 verification

Template Files

values.yaml.tmpl - Helm chart values:

# Generated: {{ .Timestamp }}
# GPU Operator Helm Values

operator:
  version: {{ .GPUOperatorVersion }}

driver:
  enabled: {{ .EnableDriver }}
  version: {{ .DriverVersion }}
  useOpenKernelModule: {{ .UseOpenKernelModule }}
  repository: {{ .DriverRegistry }}

toolkit:
  version: {{ .NvidiaContainerToolkitVersion }}

devicePlugin:
  version: {{ .DevicePluginVersion }}

dcgm:
  version: {{ .DCGMVersion }}

dcgmExporter:
  version: {{ .DCGMExporterVersion }}

mig:
  strategy: {{ .MIGStrategy }}

gds:
  enabled: {{ .EnableGDS }}

install.sh.tmpl - Installation script:

#!/bin/bash
# Generated: {{ .Timestamp }}
# GPU Operator Installation Script

set -euo pipefail

NAMESPACE="{{ .Namespace }}"
HELM_REPO="{{ .HelmRepository }}"
HELM_CHART="{{ .HelmChart }}"

echo "Adding Helm repository..."
helm repo add nvidia "$HELM_REPO"
helm repo update

echo "Installing GPU Operator..."
helm install gpu-operator nvidia/gpu-operator \
  --namespace "$NAMESPACE" \
  --create-namespace \
  --values values.yaml \
  --wait

echo "Applying ClusterPolicy..."
kubectl apply -f manifests/clusterpolicy.yaml

echo "Installation complete!"

Observability

Metrics

Prometheus metrics exposed by bundler framework:

# Duration histogram
bundler_make_duration_seconds{bundler_type="gpu-operator"} 0.245

# Total operations counter
bundler_make_total{bundler_type="gpu-operator",result="success"} 42
bundler_make_total{bundler_type="gpu-operator",result="error"} 3

# Files generated gauge
bundler_files_generated_total{bundler_type="gpu-operator"} 6

# Bytes generated gauge
bundler_bytes_generated_total{bundler_type="gpu-operator"} 15360

# Validation failures counter
bundler_validation_failures_total{bundler_type="gpu-operator"} 2

Structured Logging

slog integration for structured log output:

// Bundle generation start
slog.Debug("generating bundle",
    "bundler_type", bundlerType,
    "output_dir", outputDir,
)

// Bundle generation complete
slog.Debug("bundle generated successfully",
    "bundler_type", bundlerType,
    "files", len(result.Files),
    "bytes", result.TotalBytes,
    "duration", result.Duration,
)

Adding New Components

Adding a new component requires no Go code. Components are configured declaratively:

Step-by-Step Guide

  1. Add to Component Registry (recipes/registry.yaml):

    components:
      - name: my-operator
        displayName: My Operator
        valueOverrideKeys:
          - myoperator
        helm:
          defaultRepository: https://charts.example.com
          defaultChart: example/my-operator
          defaultVersion: v1.0.0
        nodeScheduling:
          system:
            nodeSelectorPaths:
              - operator.nodeSelector
            tolerationPaths:
              - operator.tolerations
  2. Create Values File (recipes/components/my-operator/values.yaml):

    # My Operator Helm values
    operator:
      replicas: 1
      image:
        repository: example/my-operator
        tag: v1.0.0
  3. Add to Recipe Overlay (recipes/overlays/<overlay>.yaml):

    componentRefs:
      - name: my-operator
        type: Helm
        version: v1.0.0
        source: https://charts.example.com
        valuesFile: components/my-operator/values.yaml
  4. Test the Component:

    # Generate recipe with new component
    aicr recipe --service eks --accelerator h100 -o recipe.yaml
    
    # Generate bundle
    aicr bundle -r recipe.yaml -o ./bundles
    
    # Verify output
    cat ./bundles/values.yaml

See Bundler Development Guide for detailed documentation.

Best Practices

Template Design:

  • ✅ Keep templates simple and focused
  • ✅ Use descriptive variable names
  • ✅ Add comments for complex logic
  • ✅ Validate template rendering in tests
  • ❌ Don't put business logic in templates

Error Handling:

  • ✅ Use structured errors with context
  • ✅ Wrap errors with meaningful messages
  • ✅ Validate early (before starting generation)
  • ✅ Clean up resources on error
  • ❌ Don't swallow errors silently

Testing:

  • ✅ Test with realistic recipe data
  • ✅ Use table-driven tests for coverage
  • ✅ Test error paths explicitly
  • ✅ Verify generated file content
  • ❌ Don't skip integration tests

Performance:

  • ✅ Use parallel generation for multiple files
  • ✅ Stream large files instead of buffering
  • ✅ Reuse template instances when possible
  • ✅ Profile bundle generation for bottlenecks
  • ❌ Don't generate synchronously without reason

Deployer Framework: GitOps Integration

The bundle command integrates with GitOps tools through the Deployer Framework, which generates deployment-specific artifacts alongside the standard bundle files.

Overview

Purpose: Generate GitOps-ready deployment artifacts that integrate with popular continuous delivery tools.

Supported Deployers:

Type Description Output
helm (Default) Helm per-component bundle deploy.sh, <component>/values.yaml, <component>/README.md
argocd ArgoCD Application manifests app-of-apps.yaml, <component>/application.yaml

Key Feature: Deployment Order

All deployers respect the deploymentOrder field from the recipe, ensuring components are installed in the correct sequence:

# Recipe excerpt
deploymentOrder:
  - gpu-operator      # First
  - network-operator  # Second
  - nvsentinel        # Third

Deployer Architecture

flowchart TD
    A[Bundle Command] --> B[Parse --deployer flag]
    B --> C{Deployer Type}
    C -->|helm| D[Helm Deployer]
    C -->|argocd| E[ArgoCD Deployer]
    
    D --> G[Generate Per-Component Bundle]
    E --> H[Generate Applications]
    
    G --> J[Output: deploy.sh + per-component values.yaml]
    H --> K[Output: sync-wave annotations]
    
    J --> M[Bundle Output]
    K --> M
Loading

ArgoCD Deployer

Generates ArgoCD Application manifests with proper sync ordering using multi-source Applications.

Ordering Mechanism: Uses argocd.argoproj.io/sync-wave annotation.

# gpu-operator/argocd/application.yaml (sync-wave: 0 = first)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gpu-operator
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "0"
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  sources:
    # Helm chart from upstream
    - repoURL: https://helm.ngc.nvidia.com/nvidia
      chart: gpu-operator
      targetRevision: v25.3.3
      helm:
        valueFiles:
          - $values/gpu-operator/values.yaml
    # Values from GitOps repo
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      ref: values
    # Additional manifests (if present)
    - repoURL: <YOUR_GIT_REPO>
      targetRevision: main
      path: gpu-operator/manifests
  destination:
    server: https://kubernetes.default.svc
    namespace: gpu-operator
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

Output Structure:

bundles/
├── app-of-apps.yaml               # Parent Application (bundle root)
├── recipe.yaml                    # Recipe used to generate bundle
├── gpu-operator/
│   ├── values.yaml
│   ├── manifests/
│   └── argocd/
│       └── application.yaml       # sync-wave: 0
├── network-operator/
│   ├── values.yaml
│   └── argocd/
│       └── application.yaml       # sync-wave: 1
├── nvsentinel/
│   ├── values.yaml
│   └── argocd/
│       └── application.yaml       # sync-wave: 2
└── README.md                      # ArgoCD deployment guide

Helm Deployer (Default)

Generates a Helm per-component bundle with individual component directories.

Ordering Mechanism: Dependencies listed in Chart.yaml are deployed in order by Helm.

Output Structure:

bundles/
├── gpu-operator/
│   ├── values.yaml      # Component-specific Helm values
│   ├── scripts/
│   │   └── install.sh   # Installation script
│   ├── README.md        # Deployment instructions
│   └── checksums.txt    # SHA256 checksums
├── recipe.yaml          # Input recipe reference
└── deploy.sh            # Top-level deployment script

Deployer Data Flow

sequenceDiagram
    participant CLI
    participant Bundler
    participant Deployer
    participant Template
    participant FileSystem
    
    CLI->>Bundler: bundle --deployer argocd
    Bundler->>Bundler: Generate component bundles
    Bundler->>Deployer: Generate(recipeResult, bundleDir)
    Deployer->>Deployer: orderComponentsByDeployment()
    
    loop For each component (in order)
        Deployer->>Template: Render with order metadata
        Template-->>Deployer: Rendered manifest
        Deployer->>FileSystem: Write file
    end
    
    Deployer-->>Bundler: Artifacts result
    Bundler-->>CLI: Bundle output
Loading

Usage Examples

# Default: Helm per-component bundle
aicr bundle -r recipe.yaml -o ./bundles

# Generate bundle with ArgoCD Applications
aicr bundle -r recipe.yaml --deployer argocd -o ./bundles

# ArgoCD with Git repository URL (sets repoURL in app-of-apps.yaml)
aicr bundle -r recipe.yaml --deployer argocd \
  --repo https://github.com/my-org/my-gitops-repo.git \
  -o ./bundles

# Combine with deployer
aicr bundle -r recipe.yaml \
  --deployer argocd \
  -o ./bundles

Deployment Order Implementation

The orderComponentsByDeployment function ensures components are processed in the correct sequence:

// orderComponentsByDeployment sorts components according to deploymentOrder.
// Components not in deploymentOrder are appended at the end in their original order.
func orderComponentsByDeployment(components []recipe.ComponentRef, 
    order []string) []recipe.ComponentRef {
    
    if len(order) == 0 {
        return components
    }
    
    orderMap := make(map[string]int)
    for i, name := range order {
        orderMap[name] = i
    }
    
    // Separate ordered and unordered components
    ordered := make([]recipe.ComponentRef, 0)
    unordered := make([]recipe.ComponentRef, 0)
    
    for _, c := range components {
        if _, exists := orderMap[c.Name]; exists {
            ordered = append(ordered, c)
        } else {
            unordered = append(unordered, c)
        }
    }
    
    // Sort ordered components by their position in deploymentOrder
    sort.SliceStable(ordered, func(i, j int) bool {
        return orderMap[ordered[i].Name] < orderMap[ordered[j].Name]
    })
    
    return append(ordered, unordered...)
}

Testing Deployers

Each deployer has tests verifying deployment order correctness:

func TestDeployer_Generate_DeploymentOrder(t *testing.T) {
    recipeResult := &recipe.RecipeResult{
        DeploymentOrder: []string{"gpu-operator", "network-operator"},
        ComponentRefs: []recipe.ComponentRef{
            {Name: "network-operator", Version: "v25.4.0"},
            {Name: "gpu-operator", Version: "v25.3.3"},
        },
    }
    
    d := NewDeployer()
    artifacts, err := d.Generate(ctx, recipeResult, tmpDir)
    require.NoError(t, err)
    
    // Verify ordering mechanism (sync-wave/dependsOn/README order)
    // ...
}

References

Official Documentation

Kubernetes Integration

NVIDIA Tools

Best Practices

Security