Skip to content

efs-proxy death spiral under high-throughput workloads: hard NFS mounts cause unrecoverable OOM loop #1827

@jutley

Description

@jutley

Describe the bug

When a node runs high-throughput EFS workloads, efs-plugin can be driven into a crash loop it cannot recover from without manual intervention. The sequence:

  1. A workload generates sustained high write volume against an EFS PVC
  2. efs-plugin is OOMKilled (when a memory limit is configured)
  3. Because EFS mounts use hard NFS mount mode by default, all in-flight and subsequent I/O from every EFS-backed pod on the node queues in the kernel rather than failing
  4. On the next efs-plugin restart, the Go watchdog restarts the Python watchdog, which spawns fresh efs-proxy processes per mount. The kernel NFS client immediately reconnects and flushes all queued I/O through the new proxies simultaneously
  5. The memory spike from processing the backlog causes another OOMKill before the container stabilises
  6. Repeat indefinitely — the restart loop cannot resolve itself as long as the hard-mounted volumes remain active

We confirmed this by inspecting /proc/mounts (7 EFS mounts still active with dead proxy ports), /proc/1/mountstats (millions of queued write operations), and D-state processes blocked on rpc_wait_bit_killable. After force-unmounting the high-throughput mounts from the host, efs-plugin immediately stabilised.

The underlying architectural concern

All EFS I/O on a node is proxied through a single efs-plugin container. The process tree is: Go driver → Go watchdog loop (keeps Python alive) → Python amazon-efs-mount-watchdog → one efs-proxy process per mount. All processes share one cgroup and one memory limit (when configured — the default deployment ships with no resource limits). This makes efs-plugin a node-level single point of failure for every EFS-backed workload on the node — a single OOMKill silently hangs all EFS I/O simultaneously via the hard mount behaviour.

forceUnmountAfterTimeout does not help here: it addresses stuck NodeUnpublishVolume RPCs, not the I/O backlog that forms at the kernel NFS layer while the proxy is dead. The backlog flush on restart is a kernel NFS client behaviour; there is no mechanism in the driver or watchdog to prevent, drain, or rate-limit it.

Note on encryptInTransit: "false": we tested whether disabling TLS would remove efs-proxy from the data path. It does not — even with encryption disabled, mounts still route through 127.0.0.1:<port> via efs-proxy. The proxy is architectural, not optional.

Note on memory regression since v2.0.0: the introduction of efs-proxy in v2.0.0 (replacing stunnel) brought a significant memory increase that makes OOM conditions far more likely in practice. See #1523 which reports ~6x higher memory usage after upgrading from v1.7.7 to v2.1.0, with users who previously ran fine at 150Mi now requiring 500Mi+. The memory footprint of efs-proxy relative to stunnel appears to be a contributing factor to why this failure mode is now regularly encountered.

Environment

  • Driver version: v2.3.0
  • Helm chart: aws-efs-csi-driver-3.4.0
  • Kubernetes: v1.31.2
  • OS: Flatcar Container Linux
  • Container runtime: containerd 1.6.16

Steps to reproduce

  1. Configure a memory limit on the efs-plugin container
  2. Schedule a high-throughput write workload using an EFS PVC with default hard mount options on that node
  3. Sustain write load until efs-plugin is OOMKilled
  4. Observe that efs-plugin enters an unrecoverable crash loop — restart count climbs indefinitely, memory spikes to the limit within seconds of each restart

Expected behaviour

Either:

  • efs-plugin should be able to restart without inheriting a crash-inducing I/O backlog, or
  • The design should prevent a single container OOM from silently blocking all EFS I/O across unrelated workloads on the same node

Possible mitigations to consider

  • Soft mounts or mount timeout: default NFS mount options could use soft or timeo+retrans tuning to allow I/O to fail rather than queue indefinitely when the proxy is unreachable
  • Per-mount process isolation: if each mount's proxy ran in its own cgroup/limit, one overwhelmed mount couldn't take down the others
  • Back-pressure / load shedding: efs-proxy could shed or reject new I/O when under memory pressure rather than crashing
  • Memory limit scaling is not a practical fix: raising the memory limit is a cluster-wide change — every node pays the reservation cost even though only nodes running high-throughput EFS workloads need it. Segregating high-I/O workloads onto dedicated node pools with a higher limit is theoretically possible but operationally unreasonable as a general solution to what is a driver-level problem
  • Documentation: at minimum, document that efs-plugin is in the critical I/O path for all EFS workloads on the node and that hard mounts combined with an OOMKilled proxy create this failure mode

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions