Skip to content

Releases: aws/sagemaker-hyperpod-cli

v3.7.1

08 Apr 23:19
7974959

Choose a tag to compare

New Instance Type Support

  • Add g7e instance types to HyperPod helm chart values (nvidia/EFA device plugins) (#380)
  • Add g7e instance types to Python constants and CLI (#385, #390)
  • Add g7e instance types to health-monitoring-agent node affinity (#381)
  • Add B300 MIG profiles to GPU operator ConfigMap (#396)
  • Add MIG profile support for ml.p6-b300.48xlarge (Blackwell Ultra) (#398)

Inference Operator

  • CRD updates: BYO certificate, RequestLimitsConfig, Custom Kubernetes support (#402)
  • Bump hyperpod-inference-operator subchart to v2.1.0 with image tag v3.1 (#402)

Enhancements

  • Support AWS_REGION env var, cluster context fallback, centralize boto3 client creation (#395)
  • Handle pagination in cluster stack listing (#394)
  • Require --instance-type when specifying accelerator resources (#393)

Bug Fixes

  • Fix EFA field naming in PyTorch job template v1.1: efa_interfaces -> efa, efa_interfaces_limit -> efa_limit (#392)
  • Fix deep health check nodeSelector label to sagemaker.amazonaws.com/deep-health-check-status: Passed (#386)
  • Remove non-EFA instance types from EFA device plugin nodeAffinity to prevent CrashLoopBackOff (#389)
  • Add missing instance types and fix EFA/memory resource specs (#385)

Health Monitoring Agent

  • Release Health Monitoring Agent 1.0.1434.0_1.0.388.0 (#388)

v3.7.0

02 Mar 23:49
49baa69

Choose a tag to compare

v3.7.0 (2026-03-02)

Space CLI

  • Added list all functionality and documentation updates
  • Disabled traceback for cleaner error output

Inference Operator

  • Inference Operator AddOn with NodeAffinity support and version 3.0 update
  • Updated hyperpod-inference-operator to version 2.0.0 in HyperPodHelmChart
  • Added AddOn migration script and README

Enhancements

Monitoring & Observability

  • Emit metrics for CLI commands

Testing & Validation

  • Added unit tests for inference CRDs
  • Added CRD format check for inference

Dependencies & Versions

  • Updated GPU operator container toolkit version
  • Updated aws-efa-k8s-device-plugin version to 0.5.20

Configuration

  • Instance types CRD changes

Bug Fixes

  • Fixed syntax error in inferenceendpointconfigs by removing tab

v3.6.0

27 Jan 22:36
2a46ebd

Choose a tag to compare

Features

  • Add EFA support in manifest for training jobs (#345)
  • Add end-to-end example documentation (#350)
  • Add 4 new HyperPod GA regions (ca-central-1, ap-southeast-3, ap-southeast-4, eu-south-2) (#360)

Enhancements

  • Update documentation for elastic training arguments (#343)
  • Upgrade Inference Operator helm chart (#346)
  • Update MIG config for GPU operator (#358)
  • Release Health Monitoring Agent 1.0.1249.0_1.0.359.0 with enhanced Nvidia timeout analysis and bug fixes (#361)

Bug Fixes

  • Fix canary test failures for GPU quota allocation integration tests (#356)
  • Fix region fallback logic for health-monitoring-agent image URIs (#360)
  • Remove command flag from init pytorch job integration test (#351)
  • Skip expensive integration tests to improve CI performance (#355)

Elastic Training Support for ReInvent Keynote 3

03 Dec 18:07
c64811d

Choose a tag to compare

  • Adding new command line arguments to the HyperPodTrainingOperator to support elastic training capabailities
    • --elastic-replica-increment-step, --max-node-count, --elastic-graceful-shutdown-timeout-in-seconds, --elastic-scaling-timeout-in-seconds, --elastic-scale-up-snooze-time-in-seconds, --elastic-replica-discrete-values
  • Enables dynamic scaling of compute resources during training operations

Hyperpod CLI V2 with Nova recipe support

02 Dec 12:36
9774c58

Choose a tag to compare

Hyperpod CLI V2 with Nova recipe support

Parker CLI, Fractional GPU Feature

21 Nov 09:11
0eba08f

Choose a tag to compare

-Added hp-devspace command set for ML dev environments

-New commands: create, list, get, update, delete, geturl for dev space management
Support for namespace-based auth and resource isolation
Added auth and get-config commands to check permissions and view default settings

Users can request partial GPU resources using MIG profiles instead of full GPUs

Added --accelerator-partition-type, --accelerator-partition-count, accelerator-partition-limit
New list-accelerator-partition-type command to view available GPU partitions for instance types

v3.3.1

30 Oct 20:47
7233490

Choose a tag to compare

Features

  • Describe cluster command
    • User can use hyp describe cluster to learn more info about hp clusters
  • Jinja template handling logic for inference and training
    • User can modify jinja template to add parameters supported by CRD through init experience of inference and training, for further CLI customization
  • Cluster creation template versioning
    • User can choose cloudformation template version through cluster creation expeirence
  • KVCache and intelligent routing for HyperPod Inference
    • InferenceEndpointConfig CRD supported is updated to v1
    • KVCache and Intelligent Routing support is added in template version 1.1

init experience Launch

24 Sep 19:51

Choose a tag to compare

Features

  • Init Experience
    • Init, Validate, and Create JumpStart endpoint, Custom endpoint, and PyTorch Training Job with local configuration
  • Cluster management
    • Bug fixes for cluster creation

Bug fixes

10 Sep 19:23
162fb79

Choose a tag to compare

Features

  • Fix for production canary failures caused by bad training job template.
  • New version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes.

Bug Fixes

28 Aug 00:14
5a346e8

Choose a tag to compare

  • Bug Fixes in cluster creation