efs-proxy death spiral under high-throughput workloads: hard NFS mounts cause unrecoverable OOM loop

## Describe the bug

When a node runs high-throughput EFS workloads, `efs-plugin` can be driven into a crash loop it cannot recover from without manual intervention. The sequence:

1. A workload generates sustained high write volume against an EFS PVC
2. `efs-plugin` is OOMKilled (when a memory limit is configured)
3. Because EFS mounts use `hard` NFS mount mode by default, all in-flight and subsequent I/O from every EFS-backed pod on the node queues in the kernel rather than failing
4. On the next `efs-plugin` restart, the Go watchdog restarts the Python watchdog, which spawns fresh `efs-proxy` processes per mount. The kernel NFS client immediately reconnects and flushes all queued I/O through the new proxies simultaneously
5. The memory spike from processing the backlog causes another OOMKill before the container stabilises
6. Repeat indefinitely — the restart loop cannot resolve itself as long as the hard-mounted volumes remain active

We confirmed this by inspecting `/proc/mounts` (7 EFS mounts still active with dead proxy ports), `/proc/1/mountstats` (millions of queued write operations), and D-state processes blocked on `rpc_wait_bit_killable`. After force-unmounting the high-throughput mounts from the host, `efs-plugin` immediately stabilised.

## The underlying architectural concern

All EFS I/O on a node is proxied through a single `efs-plugin` container. The process tree is: Go driver → Go watchdog loop (keeps Python alive) → Python `amazon-efs-mount-watchdog` → one `efs-proxy` process per mount. All processes share one cgroup and one memory limit (when configured — the default deployment ships with no resource limits). This makes `efs-plugin` a node-level single point of failure for every EFS-backed workload on the node — a single OOMKill silently hangs all EFS I/O simultaneously via the hard mount behaviour.

`forceUnmountAfterTimeout` does not help here: it addresses stuck `NodeUnpublishVolume` RPCs, not the I/O backlog that forms at the kernel NFS layer while the proxy is dead. The backlog flush on restart is a kernel NFS client behaviour; there is no mechanism in the driver or watchdog to prevent, drain, or rate-limit it.

**Note on `encryptInTransit: "false"`**: we tested whether disabling TLS would remove `efs-proxy` from the data path. It does not — even with encryption disabled, mounts still route through `127.0.0.1:<port>` via `efs-proxy`. The proxy is architectural, not optional.

**Note on memory regression since v2.0.0**: the introduction of `efs-proxy` in v2.0.0 (replacing `stunnel`) brought a significant memory increase that makes OOM conditions far more likely in practice. See #1523 which reports ~6x higher memory usage after upgrading from v1.7.7 to v2.1.0, with users who previously ran fine at 150Mi now requiring 500Mi+. The memory footprint of `efs-proxy` relative to `stunnel` appears to be a contributing factor to why this failure mode is now regularly encountered.

## Environment

- Driver version: v2.3.0
- Helm chart: aws-efs-csi-driver-3.4.0
- Kubernetes: v1.31.2
- OS: Flatcar Container Linux
- Container runtime: containerd 1.6.16

## Steps to reproduce

1. Configure a memory limit on the `efs-plugin` container
2. Schedule a high-throughput write workload using an EFS PVC with default hard mount options on that node
3. Sustain write load until `efs-plugin` is OOMKilled
4. Observe that `efs-plugin` enters an unrecoverable crash loop — restart count climbs indefinitely, memory spikes to the limit within seconds of each restart

## Expected behaviour

Either:
- `efs-plugin` should be able to restart without inheriting a crash-inducing I/O backlog, or
- The design should prevent a single container OOM from silently blocking all EFS I/O across unrelated workloads on the same node

## Possible mitigations to consider

- **Soft mounts or mount timeout**: default NFS mount options could use `soft` or `timeo`+`retrans` tuning to allow I/O to fail rather than queue indefinitely when the proxy is unreachable
- **Per-mount process isolation**: if each mount's proxy ran in its own cgroup/limit, one overwhelmed mount couldn't take down the others
- **Back-pressure / load shedding**: `efs-proxy` could shed or reject new I/O when under memory pressure rather than crashing
- **Memory limit scaling is not a practical fix**: raising the memory limit is a cluster-wide change — every node pays the reservation cost even though only nodes running high-throughput EFS workloads need it. Segregating high-I/O workloads onto dedicated node pools with a higher limit is theoretically possible but operationally unreasonable as a general solution to what is a driver-level problem
- **Documentation**: at minimum, document that `efs-plugin` is in the critical I/O path for all EFS workloads on the node and that hard mounts combined with an OOMKilled proxy create this failure mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

efs-proxy death spiral under high-throughput workloads: hard NFS mounts cause unrecoverable OOM loop #1827

Describe the bug

The underlying architectural concern

Environment

Steps to reproduce

Expected behaviour

Possible mitigations to consider

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

efs-proxy death spiral under high-throughput workloads: hard NFS mounts cause unrecoverable OOM loop #1827

Description

Describe the bug

The underlying architectural concern

Environment

Steps to reproduce

Expected behaviour

Possible mitigations to consider

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions