-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Pods evicted/drained are reported as terminated.reason=OOMKilled with exitCode 143 (no memory pressure) #18040
Description
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
1.33.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
$ kubectl version
Client Version: v1.35.0
Kustomize Version: v5.7.1
Server Version: v1.32.10
3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
Reproduction scenario (simplest observed so far):
-
Create a Kubernetes cluster on AWS using kOps with
- Kubernetes v1.32.x (kube-proxy
v1.32.10), nodeTerminationHandlerenabled in the kOps cluster spec (usingaws-node-termination-handler),- Standard kubelet resource reservations (no extreme overcommit).
- Kubernetes v1.32.x (kube-proxy
-
Run a workload pod on a node, e.g.:
- Namespace:
zookeeper - Pod:
cluster-2 - Container:
zookeeper - Container resources:
resources.requests.memory: 512Mi, no memory limit set. - The pod runs for a long time (in our case since
2026-02-11T10:11:23Z).
- Namespace:
-
Trigger a node drain / pod eviction not caused by memory pressure, for example:
- By draining the node (
kubectl drain <node> --force --ignore-daemonsets --delete-emptydir-data), or - By causing the node to be terminated by AWS (e.g. via
aws-node-termination-handler/ ASG scale-in / maintenance), - Or via a rolling update that cordons & drains the node.
- By draining the node (
-
Observe the pod status after the eviction/termination:
kubectl get pod -n zookeeper cluster-2 -o yaml- And/or watch application logs and metrics around the time of termination.
We have reproduced this multiple times by evicting pods on kOps-managed nodes in this way.
5. What happened after the commands executed?
For an example pod (zookeeper/cluster-2):
-
The pod is eventually marked as
phase: Failed. -
The container (
zookeeper) is reported inpod.status.containerStatusesas:state.terminated.reason = "OOMKilled"state.terminated.exitCode = 143state.terminated.startedAt = 2026-02-11T10:11:23Zstate.terminated.finishedAt = 2026-03-06T13:34:26Z
-
There is no corresponding memory pressure or OOM indication:
- Prometheus (
kubelet/ cAdvisor) shows:container_memory_working_set_bytes{namespace="zookeeper",pod="cluster-2",container="zookeeper"}
is stable around ~350–370 MiB before termination, well below the 512Mi request.container_oom_events_total{namespace="zookeeper",pod="cluster-2",container="zookeeper"} = 0
at and after the termination time.
- Kubernetes Events in Loki for this pod show:
reason=Unhealthyliveness probe failures (Zookeeper liveness script exit 1),- No events indicating OOM or memory pressure.
- Prometheus (
-
An observability component (Robusta) logs the pod status it receives from the API server and confirms that
pod.status.containerStatuses[*].state.terminated.reasonis"OOMKilled"withexitCode=143for the
terminating container at the time the alert is raised.
So effectively, pods that are being drained/evicted without memory pressure are reported as if they were OOMKilled,
but with exit code 143 (SIGTERM).
6. What did you expect to happen?
- For non-OOM terminations such as node drains or manual pod deletions we would expect:
container.state.terminated.reasonnot to be"OOMKilled"(e.g."Error","Completed", or other appropriate reasons),- Or the pod-level
status.reasonto indicateEvictedorNodeShutdownwhile container termination reason
remains consistent with the signal / exit code (143 for SIGTERM, 0 for clean exit, etc.).
- Specifically, we do not expect the
OOMKilledreason- For pods being terminated due to node drains or liveness failures without any evidence of memory pressure.
This misclassification is problematic because downstream controllers/alerting (including ours) rely on
containerStatuses[*].state.terminated.reasonto distinguish true OOM events from expected evictions.
- For pods being terminated due to node drains or liveness failures without any evidence of memory pressure.
7. Please provide your cluster manifest.
# REDACTED / simplified example
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
name: prod
spec:
cloudProvider: aws
kubernetesVersion: 1.32.10
api:
loadBalancer:
type: Public
nodeTerminationHandler:
enabled: true
cpuRequest: 200m
enableRebalanceMonitoring: true
enableSQSTerminationDraining: true
managedASGTag: "aws-node-termination-handler/managed"
prometheusEnable: true
kubelet:
# Typical reservations; nothing exotic
kubeReserved:
cpu: "1"
memory: "2Gi"
ephemeral-storage: "1Gi"
systemReserved:
cpu: "500m"
memory: "1Gi"
ephemeral-storage: "1Gi"
enforceNodeAllocatable: "pods,system-reserved,kube-reserved"
# ... other standard kOps cluster config (subnets, IAM, etc.) ...8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.
n.a.
9. Anything else do we need to know?
Any guidance is appreciated 🙏