Skip to content

Feature: Sum all non-root EBS volumes for ephemeral-storage capacity calculation #8946

@harshad3339

Description

@harshad3339

Description

What problem are you trying to solve?
We're running Bottlerocket nodes with multiple EBS volumes that are aggregated via RAID0 (using a bootstrap container). Karpenter's ephemeralStorage() function returns only a single block device size on nodeclaims, not reflecting the actual storage available to pods after our userdata aggregates the volumes.

Code reference:

func ephemeralStorage(info ec2types.InstanceTypeInfo, amiFamily amifamily.AMIFamily, blockDeviceMappings []*v1.BlockDeviceMapping, instanceStorePolicy *v1.InstanceStorePolicy) *resource.Quantity {
// If local store disks have been configured for node ephemeral-storage, use the total size of the disks.
if lo.FromPtr(instanceStorePolicy) == v1.InstanceStorePolicyRAID0 {
if info.InstanceStorageInfo != nil && info.InstanceStorageInfo.TotalSizeInGB != nil {
return resources.Quantity(fmt.Sprintf("%dG", *info.InstanceStorageInfo.TotalSizeInGB))
}
}
if len(blockDeviceMappings) != 0 {
// First check if there's a root volume configured in blockDeviceMappings.
if blockDeviceMapping, ok := lo.Find(blockDeviceMappings, func(bdm *v1.BlockDeviceMapping) bool {
return bdm.RootVolume
}); ok && blockDeviceMapping.EBS.VolumeSize != nil {
return blockDeviceMapping.EBS.VolumeSize
}
switch amiFamily.(type) {
case *amifamily.Custom:
// We can't know if a custom AMI is going to have a volume size.
volumeSize := blockDeviceMappings[len(blockDeviceMappings)-1].EBS.VolumeSize
return lo.Ternary(volumeSize != nil, volumeSize, amifamily.DefaultEBS.VolumeSize)
default:
// If a block device mapping exists in the provider for the root volume, use the volume size specified in the provider. If not, use the default
if blockDeviceMapping, ok := lo.Find(blockDeviceMappings, func(bdm *v1.BlockDeviceMapping) bool {
return *bdm.DeviceName == *amiFamily.EphemeralBlockDevice()
}); ok && blockDeviceMapping.EBS.VolumeSize != nil {
return blockDeviceMapping.EBS.VolumeSize
}
}
}
//Return the ephemeralBlockDevice size if defined in ami
if ephemeralBlockDevice, ok := lo.Find(amiFamily.DefaultBlockDeviceMappings(), func(item *v1.BlockDeviceMapping) bool {
return *amiFamily.EphemeralBlockDevice() == *item.DeviceName
}); ok {
return ephemeralBlockDevice.EBS.VolumeSize
}
return amifamily.DefaultEBS.VolumeSize

What the Bootstrap Container Does:

Detects all non-root EBS volumes: /dev/xvdb (18Gi), /dev/xvde (20Gi), /dev/xvdf (20Gi), /dev/xvdg (20Gi)
Creates RAID0 array from all 4 volumes
Mounts combined storage (78Gi total) to /data
Bind mounts /var/lib/kubelet and /var/lib/containerd to RAID array

The Gap:
Karpenter calculates for nodeclaim: 18Gi (only /dev/xvdb - the EphemeralBlockDevice)
Actual available storage on node: 78Gi (RAID0 of all 4 non-root volumes)
Bin-packing decisions are based on incorrect capacity, causing premature scaling.

Example EC2NodeClass configuration (Bottlerocket):

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
spec:
  amiFamily: Bottlerocket
  blockDeviceMappings:
    - deviceName: /dev/xvda          # Root volume: 2Gi (OS only)
      ebs:
        volumeSize: 2Gi
        volumeType: gp3
        encrypted: true
    - deviceName: /dev/xvdb          # EphemeralBlockDevice: 18Gi
      ebs:
        volumeSize: 18Gi
        volumeType: gp3
        encrypted: true
    - deviceName: /dev/xvde          # Additional: 20Gi
      ebs:
        volumeSize: 20Gi
        volumeType: gp3
        encrypted: true
    - deviceName: /dev/xvdf          # Additional: 20Gi
      ebs:
        volumeSize: 20Gi
        volumeType: gp3
        encrypted: true
    - deviceName: /dev/xvdg          # Additional: 20Gi
      ebs:
        volumeSize: 20Gi
        volumeType: gp3
        encrypted: true

Question:

How should we handle scenarios where userdata/bootstrap scripts aggregate multiple volumes?

Annotation-based override - e.g., karpenter.sh/ephemeral-storage-override: "78Gi"
New field in EC2NodeClass - e.g., spec.ephemeralStorageCapacity: 78Gi
Label-based volume selection - Mark which volumes contribute to ephemeral storage
Sum all non-root volumes

What's the recommended approach for this use case?
How important is this feature to you?

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or requesttriage/needs-investigationIssues that need to be investigated before triaging

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions