Skip to content

Add cpu metrics for k8s pods and containers#3539

Open
jmmcorreia wants to merge 4 commits intoopen-telemetry:mainfrom
jmmcorreia:k8s_cpu_metrics
Open

Add cpu metrics for k8s pods and containers#3539
jmmcorreia wants to merge 4 commits intoopen-telemetry:mainfrom
jmmcorreia:k8s_cpu_metrics

Conversation

@jmmcorreia
Copy link
Copy Markdown

@jmmcorreia jmmcorreia commented Mar 12, 2026

NOTE: PR ON HOLD

This PR is on hold until issue #3558 is resolved and K8S PR kubernetes/kubernetes#136676 is merged. Due to KEP https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5419-pod-level-resources-in-place-resize/README.md, limit value will have two possible states from k8s v1.36 onwards, desired and actual. Issue #3558 attempts to solve that problem for containers since the feature is already available for those. Learnings from there will be ported later on to this PR.

Related to #2768

Changes

This PR proposes new k8s cpu metrics for the pod-level resources spec introduced in k8s v1.34 and related to issue #2768

  • k8s.pod.cpu.limit
  • k8s.pod.cpu.request
  • k8s.pod.cpu.limit_utilization
  • k8s.pod.cpu.request_utilization

With the introduction of pod-level resource limits and resources for CPU and memory, it makes sense to align with container metrics and introduce these for those who now want to define resource limits/requests at pod level.

There is a PR out for Kubelet stats receiver to update k8s.pod.cpu.limit_utilization and k8s.pod.cpu.request_utilization to make use of the new k8s feature. Note that since they already existed in the receiver, the proposal is to extend on top of the existing behavior instead of replacing it. (open-telemetry/opentelemetry-collector-contrib#46464)

Assuming they are approved, k8s.pod.cpu.limit and k8s.pod.cpu.request would later introduced in k8scluster receiver alongside the container counterparts.

@github-actions github-actions bot added enhancement New feature or request area:k8s labels Mar 12, 2026
@jmmcorreia jmmcorreia marked this pull request as ready for review March 13, 2026 09:33
@jmmcorreia jmmcorreia requested review from a team as code owners March 13, 2026 09:33
@jmmcorreia jmmcorreia changed the title [DRAFT] Add cpu metrics for k8s pods and containers Add cpu metrics for k8s pods and containers Mar 13, 2026
Copy link
Copy Markdown
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this into smaller standalone PRs?

I would focus on k8s.pod.cpu.limit_utilization and k8s.pod.cpu.request_utilization first so as to unblock open-telemetry/opentelemetry-collector-contrib#46464

@jmmcorreia
Copy link
Copy Markdown
Author

jmmcorreia commented Mar 16, 2026

Can we split this into smaller standalone PRs?

I would focus on k8s.pod.cpu.limit_utilization and k8s.pod.cpu.request_utilization first so as to unblock open-telemetry/opentelemetry-collector-contrib#46464

Sure, makes sense. I will keep this PR for k8s.pod.cpu.limit_utilization and k8s.pod.cpu.request_utilization, and create follow up PRs for the other metrics.

EDIT: Note, for now I also kept pod.cpu.requests and pod.cpu.limits in this PR since they are closely related and also part of the same k8s change. But, if needed, I can still break this even further to only keep the utilization metrics.

brief: "Maximum CPU resource limit set for the pod."
note: |
Pod-level CPU limits are supported from Kubernetes 1.34+ via the PodSpec resources field.
When not specified at the pod level, the value may be derived from the sum of container limits.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we emit k8s.pod.cpu.limit if it's not explicitly set on the Pod? I think deriving this as a sum from container limits should only be a concern for the k8s.pod.cpu.limit_utilization metric.

Copy link
Copy Markdown
Author

@jmmcorreia jmmcorreia Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, I probably should have detailed a little more why I proposed it this way.

K8s itself also starts with an aggregation of container resources (i.e. https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-helpers/resource/helpers.go#L343)

It then overwrites the aggregated values if something is defined at the pod-level resource spec. What this means is that if for example CPU limit is defined at pod-level resource spec, we might still see pod-level cpu request (and also memory limit/request) being emitted, since k8s provides those by aggregating the container values. (Or conversely, we could see cpu limit being emitted despite mby only the request being present on the pod level spec)

Based on this, I preferred this approach due to the following reasons:

  • The current proposal covers the k8s behavior where emitted values might actually be related to an aggregation of container values instead of something coming from the spec.
  • It gives more flexibility to also emit limit and request aligned with the limit/request utilization calculations. Hence, any user is either able to enable limit/request and use them as kubeletstats receiver does, or they can just get the already computed value instead by enabling utilization.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank's for the context. In that case should we make the definition more accurate? Sth like 'the reported value is the "Effective Limit" following the formulas from Kubernetes' and link to any helper function and KEP?

I also see 2 open questions here:

  1. What should we do for the overhead that might also be used at https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-helpers/resource/helpers.go#L365? Should we always include it? If K8s adds this by default then maybe we should too.
  2. Should we also include the definition from https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-helpers/resource/helpers.go#L379-L380? BTW, does k8s require all containers to have limit/request set for the summary to be used as the fallback? If not then maybe we need to adjust accordingly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are definitely right, we will need to make the definitions more accurate. I actually realized that there were some things that I misunderstood even after having looked at the k8s code. So, this could even be worse if someone was going only from the current definition.

I will just detail here a little bit what I know and where I'm focusing right now in case someone is also able to chime in on this.

Regarding pod overhead, you are right, I think we probably should also include it. Thankfully, this should be straightforward as it will be included in the /pods API, so no additional API calls will be needed.

Regarding point 2, I think this is were I need some more exploratory to make sure all gaps are closed. This is what I know so far:

Right now, I'm exploring to check if it's possible to get access to the limit and request values as computed by k8s using the PodLimits and PodRequests methods respectively. If that is the case, maybe we could consider using those and aligning the definition with that. However, let me know if you see any reason for us to avoid going this route, in which case we can discuss the remaining alternatives.

- k8s.pod
note: >
The value range is [0.0, 1.0]. A value of 1.0 means the pod is using 100% of its CPU limit.
If the CPU limit is not set for the pod, or if one of the pod's containers has no CPU limit, this metric SHOULD NOT be emitted.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead define here that the metric is either taken directly from the Pod spec otherwise it is calculated based on container's limits if all of them define one?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I detailed in the comment above why that was covered in the limit section (i.e. mostly due to k8s behavior). I can make the change, but I feel this way provides a little more leeway to the limit value and keeps limit and limit_utilization more aligned with each other.

If the sum of container's limits only happens for the utilization metric, then imo the two metrics might feel somewhat inconsistent with each other, if that makes sense.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! My point is mostly about explicitly defining how these ratios are calculated. We can re-use the definition from the k8s.pod.cpu.limit/request to state how the limits/requests are computed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I understand it now and that is a great point! I am going to address the comment regarding the limits/requests definition to make it more specific and clear. At that point, will also change the definition here to actually rely more on the other metrics.

@lmolkova lmolkova moved this from Untriaged to Awaiting codeowners approval in Semantic Conventions Triage Mar 17, 2026
@lmolkova lmolkova moved this from Awaiting codeowners approval to Blocked in Semantic Conventions Triage Mar 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

@github-actions github-actions bot added the Stale label Apr 2, 2026
@lmolkova lmolkova moved this from Blocked to Draft in Semantic Conventions Triage Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:k8s enhancement New feature or request Stale

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants