Describe the bug
When a node runs high-throughput EFS workloads, efs-plugin can be driven into a crash loop it cannot recover from without manual intervention. The sequence:
- A workload generates sustained high write volume against an EFS PVC
efs-plugin is OOMKilled (when a memory limit is configured)
- Because EFS mounts use
hard NFS mount mode by default, all in-flight and subsequent I/O from every EFS-backed pod on the node queues in the kernel rather than failing
- On the next
efs-plugin restart, the Go watchdog restarts the Python watchdog, which spawns fresh efs-proxy processes per mount. The kernel NFS client immediately reconnects and flushes all queued I/O through the new proxies simultaneously
- The memory spike from processing the backlog causes another OOMKill before the container stabilises
- Repeat indefinitely — the restart loop cannot resolve itself as long as the hard-mounted volumes remain active
We confirmed this by inspecting /proc/mounts (7 EFS mounts still active with dead proxy ports), /proc/1/mountstats (millions of queued write operations), and D-state processes blocked on rpc_wait_bit_killable. After force-unmounting the high-throughput mounts from the host, efs-plugin immediately stabilised.
The underlying architectural concern
All EFS I/O on a node is proxied through a single efs-plugin container. The process tree is: Go driver → Go watchdog loop (keeps Python alive) → Python amazon-efs-mount-watchdog → one efs-proxy process per mount. All processes share one cgroup and one memory limit (when configured — the default deployment ships with no resource limits). This makes efs-plugin a node-level single point of failure for every EFS-backed workload on the node — a single OOMKill silently hangs all EFS I/O simultaneously via the hard mount behaviour.
forceUnmountAfterTimeout does not help here: it addresses stuck NodeUnpublishVolume RPCs, not the I/O backlog that forms at the kernel NFS layer while the proxy is dead. The backlog flush on restart is a kernel NFS client behaviour; there is no mechanism in the driver or watchdog to prevent, drain, or rate-limit it.
Note on encryptInTransit: "false": we tested whether disabling TLS would remove efs-proxy from the data path. It does not — even with encryption disabled, mounts still route through 127.0.0.1:<port> via efs-proxy. The proxy is architectural, not optional.
Note on memory regression since v2.0.0: the introduction of efs-proxy in v2.0.0 (replacing stunnel) brought a significant memory increase that makes OOM conditions far more likely in practice. See #1523 which reports ~6x higher memory usage after upgrading from v1.7.7 to v2.1.0, with users who previously ran fine at 150Mi now requiring 500Mi+. The memory footprint of efs-proxy relative to stunnel appears to be a contributing factor to why this failure mode is now regularly encountered.
Environment
- Driver version: v2.3.0
- Helm chart: aws-efs-csi-driver-3.4.0
- Kubernetes: v1.31.2
- OS: Flatcar Container Linux
- Container runtime: containerd 1.6.16
Steps to reproduce
- Configure a memory limit on the
efs-plugin container
- Schedule a high-throughput write workload using an EFS PVC with default hard mount options on that node
- Sustain write load until
efs-plugin is OOMKilled
- Observe that
efs-plugin enters an unrecoverable crash loop — restart count climbs indefinitely, memory spikes to the limit within seconds of each restart
Expected behaviour
Either:
efs-plugin should be able to restart without inheriting a crash-inducing I/O backlog, or
- The design should prevent a single container OOM from silently blocking all EFS I/O across unrelated workloads on the same node
Possible mitigations to consider
- Soft mounts or mount timeout: default NFS mount options could use
soft or timeo+retrans tuning to allow I/O to fail rather than queue indefinitely when the proxy is unreachable
- Per-mount process isolation: if each mount's proxy ran in its own cgroup/limit, one overwhelmed mount couldn't take down the others
- Back-pressure / load shedding:
efs-proxy could shed or reject new I/O when under memory pressure rather than crashing
- Memory limit scaling is not a practical fix: raising the memory limit is a cluster-wide change — every node pays the reservation cost even though only nodes running high-throughput EFS workloads need it. Segregating high-I/O workloads onto dedicated node pools with a higher limit is theoretically possible but operationally unreasonable as a general solution to what is a driver-level problem
- Documentation: at minimum, document that
efs-plugin is in the critical I/O path for all EFS workloads on the node and that hard mounts combined with an OOMKilled proxy create this failure mode
Describe the bug
When a node runs high-throughput EFS workloads,
efs-plugincan be driven into a crash loop it cannot recover from without manual intervention. The sequence:efs-pluginis OOMKilled (when a memory limit is configured)hardNFS mount mode by default, all in-flight and subsequent I/O from every EFS-backed pod on the node queues in the kernel rather than failingefs-pluginrestart, the Go watchdog restarts the Python watchdog, which spawns freshefs-proxyprocesses per mount. The kernel NFS client immediately reconnects and flushes all queued I/O through the new proxies simultaneouslyWe confirmed this by inspecting
/proc/mounts(7 EFS mounts still active with dead proxy ports),/proc/1/mountstats(millions of queued write operations), and D-state processes blocked onrpc_wait_bit_killable. After force-unmounting the high-throughput mounts from the host,efs-pluginimmediately stabilised.The underlying architectural concern
All EFS I/O on a node is proxied through a single
efs-plugincontainer. The process tree is: Go driver → Go watchdog loop (keeps Python alive) → Pythonamazon-efs-mount-watchdog→ oneefs-proxyprocess per mount. All processes share one cgroup and one memory limit (when configured — the default deployment ships with no resource limits). This makesefs-plugina node-level single point of failure for every EFS-backed workload on the node — a single OOMKill silently hangs all EFS I/O simultaneously via the hard mount behaviour.forceUnmountAfterTimeoutdoes not help here: it addresses stuckNodeUnpublishVolumeRPCs, not the I/O backlog that forms at the kernel NFS layer while the proxy is dead. The backlog flush on restart is a kernel NFS client behaviour; there is no mechanism in the driver or watchdog to prevent, drain, or rate-limit it.Note on
encryptInTransit: "false": we tested whether disabling TLS would removeefs-proxyfrom the data path. It does not — even with encryption disabled, mounts still route through127.0.0.1:<port>viaefs-proxy. The proxy is architectural, not optional.Note on memory regression since v2.0.0: the introduction of
efs-proxyin v2.0.0 (replacingstunnel) brought a significant memory increase that makes OOM conditions far more likely in practice. See #1523 which reports ~6x higher memory usage after upgrading from v1.7.7 to v2.1.0, with users who previously ran fine at 150Mi now requiring 500Mi+. The memory footprint ofefs-proxyrelative tostunnelappears to be a contributing factor to why this failure mode is now regularly encountered.Environment
Steps to reproduce
efs-plugincontainerefs-pluginis OOMKilledefs-pluginenters an unrecoverable crash loop — restart count climbs indefinitely, memory spikes to the limit within seconds of each restartExpected behaviour
Either:
efs-pluginshould be able to restart without inheriting a crash-inducing I/O backlog, orPossible mitigations to consider
softortimeo+retranstuning to allow I/O to fail rather than queue indefinitely when the proxy is unreachableefs-proxycould shed or reject new I/O when under memory pressure rather than crashingefs-pluginis in the critical I/O path for all EFS workloads on the node and that hard mounts combined with an OOMKilled proxy create this failure mode