Skip to content

Add support for configuring efs-utils.conf settings via environment variables or Helm values #1833

@imuni4fun

Description

@imuni4fun

Is your feature request related to a problem? Please describe.

Yes. The EFS CSI driver currently bundles efs-utils.conf in the container image with no way to override specific settings without replacing the entire configuration file. This creates a maintenance burden when operators need to tune specific parameters for production reliability.

We encountered a race condition on Bottlerocket nodes where TLS certificate rotation (every 60 minutes via tls_cert_renewal_interval_min = 60) kills stunnel (or efs-proxy) during mass pod evictions. When this coincides with 60+ pods being terminated, the watchdog's health check interval of 5 minutes (stunnel_health_check_interval_min = 5) is too slow to detect and restart the dead stunnel process. This causes mount operations to fill all --max-inflight-mount-calls slots with stuck goroutines, deadlocking the node.

Describe the solution you'd like in detail

Add support for overriding specific efs-utils.conf settings via environment variables or Helm values, so operators can tune parameters without owning the entire config file.

Option 1: Environment Variables

Option 2: Helm Values

The driver's entrypoint would merge these overrides with the bundled efs-utils.conf, preserving upstream defaults for settings not explicitly overridden. This allows critical production tuning while automatically receiving updates from new efs-utils versions (new regions, security patches, feature flags).

Describe alternatives you've considered

Own the entire efs-utils.conf via ConfigMap - This works but means we miss automatic updates when new AWS regions are added, security patches are applied, or new configuration options are introduced in efs-utils updates.

Init container to copy and modify config - Similar to option 1, requires maintaining a full copy of the config.

Fork and maintain a custom image - Too much overhead for what should be a simple configuration override.

All alternatives require tracking upstream changes manually and rebasing our config file with each efs-utils release.

Additional context

  • Our specific fix includes stunnel_health_check_interval_min = 1 (down from default 5) to detect dead stunnel within 60 seconds instead of 5 minutes
  • Combined with --force-unmount-after-timeout=true and --max-inflight-mount-calls=30, this reduces the chance the node will deadlock during TLS cert rotation
  • This issue is more prevalent on Bottlerocket (read-only root FS) than AL2, making configuration flexibility even more important
  • Related to efs-utils.conf not updated during Helm upgrade causing FIPS configuration mismatch #1821 about efs-utils.conf not updating during Helm upgrades

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions