Skip to content

Pods evicted/drained are reported as terminated.reason=OOMKilled with exitCode 143 (no memory pressure) #18040

@adamantal

Description

@adamantal

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.33.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

$ kubectl version
Client Version: v1.35.0
Kustomize Version: v5.7.1
Server Version: v1.32.10

3. What cloud provider are you using?

aws

4. What commands did you run? What is the simplest way to reproduce this issue?

Reproduction scenario (simplest observed so far):

  1. Create a Kubernetes cluster on AWS using kOps with

    • Kubernetes v1.32.x (kube-proxy v1.32.10),
    • nodeTerminationHandler enabled in the kOps cluster spec (using aws-node-termination-handler),
    • Standard kubelet resource reservations (no extreme overcommit).
  2. Run a workload pod on a node, e.g.:

    • Namespace: zookeeper
    • Pod: cluster-2
    • Container: zookeeper
    • Container resources: resources.requests.memory: 512Mi, no memory limit set.
    • The pod runs for a long time (in our case since 2026-02-11T10:11:23Z).
  3. Trigger a node drain / pod eviction not caused by memory pressure, for example:

    • By draining the node (kubectl drain <node> --force --ignore-daemonsets --delete-emptydir-data), or
    • By causing the node to be terminated by AWS (e.g. via aws-node-termination-handler / ASG scale-in / maintenance),
    • Or via a rolling update that cordons & drains the node.
  4. Observe the pod status after the eviction/termination:

    • kubectl get pod -n zookeeper cluster-2 -o yaml
    • And/or watch application logs and metrics around the time of termination.

We have reproduced this multiple times by evicting pods on kOps-managed nodes in this way.

5. What happened after the commands executed?

For an example pod (zookeeper/cluster-2):

  • The pod is eventually marked as phase: Failed.

  • The container (zookeeper) is reported in pod.status.containerStatuses as:

    • state.terminated.reason = "OOMKilled"
    • state.terminated.exitCode = 143
    • state.terminated.startedAt = 2026-02-11T10:11:23Z
    • state.terminated.finishedAt = 2026-03-06T13:34:26Z
  • There is no corresponding memory pressure or OOM indication:

    • Prometheus (kubelet / cAdvisor) shows:
      • container_memory_working_set_bytes{namespace="zookeeper",pod="cluster-2",container="zookeeper"}
        is stable around ~350–370 MiB before termination, well below the 512Mi request.
      • container_oom_events_total{namespace="zookeeper",pod="cluster-2",container="zookeeper"} = 0
        at and after the termination time.
    • Kubernetes Events in Loki for this pod show:
      • reason=Unhealthy liveness probe failures (Zookeeper liveness script exit 1),
      • No events indicating OOM or memory pressure.
  • An observability component (Robusta) logs the pod status it receives from the API server and confirms that
    pod.status.containerStatuses[*].state.terminated.reason is "OOMKilled" with exitCode=143 for the
    terminating container at the time the alert is raised.
    So effectively, pods that are being drained/evicted without memory pressure are reported as if they were OOMKilled,
    but with exit code 143 (SIGTERM)
    .

6. What did you expect to happen?

  • For non-OOM terminations such as node drains or manual pod deletions we would expect:
    • container.state.terminated.reason not to be "OOMKilled" (e.g. "Error", "Completed", or other appropriate reasons),
    • Or the pod-level status.reason to indicate Evicted or NodeShutdown while container termination reason
      remains consistent with the signal / exit code (143 for SIGTERM, 0 for clean exit, etc.).
  • Specifically, we do not expect the OOMKilled reason
    • For pods being terminated due to node drains or liveness failures without any evidence of memory pressure.
      This misclassification is problematic because downstream controllers/alerting (including ours) rely on
      containerStatuses[*].state.terminated.reason to distinguish true OOM events from expected evictions.

7. Please provide your cluster manifest.

# REDACTED / simplified example
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: prod
spec:
  cloudProvider: aws
  kubernetesVersion: 1.32.10
  api:
    loadBalancer:
      type: Public
  nodeTerminationHandler:
    enabled: true
    cpuRequest: 200m
    enableRebalanceMonitoring: true
    enableSQSTerminationDraining: true
    managedASGTag: "aws-node-termination-handler/managed"
    prometheusEnable: true
  kubelet:
    # Typical reservations; nothing exotic
    kubeReserved:
      cpu: "1"
      memory: "2Gi"
      ephemeral-storage: "1Gi"
    systemReserved:
      cpu: "500m"
      memory: "1Gi"
      ephemeral-storage: "1Gi"
    enforceNodeAllocatable: "pods,system-reserved,kube-reserved"
  # ... other standard kOps cluster config (subnets, IAM, etc.) ...

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

n.a.

9. Anything else do we need to know?

Any guidance is appreciated 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions