Add cpu metrics for k8s pods and containers by jmmcorreia · Pull Request #3539 · open-telemetry/semantic-conventions

jmmcorreia · 2026-03-12T17:58:04Z

NOTE: PR ON HOLD

This PR is on hold until issue #3558 is resolved and K8S PR kubernetes/kubernetes#136676 is merged. Due to KEP https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5419-pod-level-resources-in-place-resize/README.md, limit value will have two possible states from k8s v1.36 onwards, desired and actual. Issue #3558 attempts to solve that problem for containers since the feature is already available for those. Learnings from there will be ported later on to this PR.

Related to #2768

Changes

This PR proposes new k8s cpu metrics for the pod-level resources spec introduced in k8s v1.34 and related to issue #2768

k8s.pod.cpu.limit
k8s.pod.cpu.request
k8s.pod.cpu.limit_utilization
k8s.pod.cpu.request_utilization

With the introduction of pod-level resource limits and resources for CPU and memory, it makes sense to align with container metrics and introduce these for those who now want to define resource limits/requests at pod level.

There is a PR out for Kubelet stats receiver to update k8s.pod.cpu.limit_utilization and k8s.pod.cpu.request_utilization to make use of the new k8s feature. Note that since they already existed in the receiver, the proposal is to extend on top of the existing behavior instead of replacing it. (open-telemetry/opentelemetry-collector-contrib#46464)

Assuming they are approved, k8s.pod.cpu.limit and k8s.pod.cpu.request would later introduced in k8scluster receiver alongside the container counterparts.

ChrsMark

Can we split this into smaller standalone PRs?

I would focus on k8s.pod.cpu.limit_utilization and k8s.pod.cpu.request_utilization first so as to unblock open-telemetry/opentelemetry-collector-contrib#46464

jmmcorreia · 2026-03-16T13:30:50Z

Can we split this into smaller standalone PRs?

I would focus on k8s.pod.cpu.limit_utilization and k8s.pod.cpu.request_utilization first so as to unblock open-telemetry/opentelemetry-collector-contrib#46464

Sure, makes sense. I will keep this PR for k8s.pod.cpu.limit_utilization and k8s.pod.cpu.request_utilization, and create follow up PRs for the other metrics.

EDIT: Note, for now I also kept pod.cpu.requests and pod.cpu.limits in this PR since they are closely related and also part of the same k8s change. But, if needed, I can still break this even further to only keep the utilization metrics.

ChrsMark · 2026-03-16T15:09:49Z

model/k8s/metrics.yaml

+    brief: "Maximum CPU resource limit set for the pod."
+    note: |
+      Pod-level CPU limits are supported from Kubernetes 1.34+ via the PodSpec resources field.
+      When not specified at the pod level, the value may be derived from the sum of container limits.


Should we emit k8s.pod.cpu.limit if it's not explicitly set on the Pod? I think deriving this as a sum from container limits should only be a concern for the k8s.pod.cpu.limit_utilization metric.

That is a good point, I probably should have detailed a little more why I proposed it this way.

K8s itself also starts with an aggregation of container resources (i.e. https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-helpers/resource/helpers.go#L343)

It then overwrites the aggregated values if something is defined at the pod-level resource spec. What this means is that if for example CPU limit is defined at pod-level resource spec, we might still see pod-level cpu request (and also memory limit/request) being emitted, since k8s provides those by aggregating the container values. (Or conversely, we could see cpu limit being emitted despite mby only the request being present on the pod level spec)

Based on this, I preferred this approach due to the following reasons:

The current proposal covers the k8s behavior where emitted values might actually be related to an aggregation of container values instead of something coming from the spec.

It gives more flexibility to also emit limit and request aligned with the limit/request utilization calculations. Hence, any user is either able to enable limit/request and use them as kubeletstats receiver does, or they can just get the already computed value instead by enabling utilization.

I see, thank's for the context. In that case should we make the definition more accurate? Sth like 'the reported value is the "Effective Limit" following the formulas from Kubernetes' and link to any helper function and KEP?

I also see 2 open questions here:

What should we do for the overhead that might also be used at https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-helpers/resource/helpers.go#L365? Should we always include it? If K8s adds this by default then maybe we should too.

Should we also include the definition from https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-helpers/resource/helpers.go#L379-L380? BTW, does k8s require all containers to have limit/request set for the summary to be used as the fallback? If not then maybe we need to adjust accordingly.

You are definitely right, we will need to make the definitions more accurate. I actually realized that there were some things that I misunderstood even after having looked at the k8s code. So, this could even be worse if someone was going only from the current definition.

I will just detail here a little bit what I know and where I'm focusing right now in case someone is also able to chime in on this.

Regarding pod overhead, you are right, I think we probably should also include it. Thankfully, this should be straightforward as it will be included in the /pods API, so no additional API calls will be needed.

Regarding point 2, I think this is were I need some more exploratory to make sure all gaps are closed. This is what I know so far:

Sum of container limits would generally match the pod limits, unless somehow the init containers have a higher limit, in which case then the init container limit is used

Defining pod level limits, but no requests, means requests will get populated (https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/v1/defaults.go#L198). However, this does not work the other way around, k8s only needs requests to be present for scheduling. Hence, this also creates an asymmetric behavior that I first forgot to account for

Due to the prior comment, this means k8s pods API will never return limits as a sum of container resources (This can ever only happen for requests). Initially I wrongly assumed this would go both ways.

Due to point 1 in https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5419-pod-level-resources-in-place-resize/README.md#risks-and-mitigations, if this resize feature ever is enabled, the pod spec information might actually not represent the value k8s is using.

Right now, I'm exploring to check if it's possible to get access to the limit and request values as computed by k8s using the PodLimits and PodRequests methods respectively. If that is the case, maybe we could consider using those and aligning the definition with that. However, let me know if you see any reason for us to avoid going this route, in which case we can discuss the remaining alternatives.

ChrsMark · 2026-03-16T15:11:28Z

model/k8s/metrics.yaml

+      - k8s.pod
+    note: >
+      The value range is [0.0, 1.0]. A value of 1.0 means the pod is using 100% of its CPU limit.
+      If the CPU limit is not set for the pod, or if one of the pod's containers has no CPU limit, this metric SHOULD NOT be emitted.


Should we instead define here that the metric is either taken directly from the Pod spec otherwise it is calculated based on container's limits if all of them define one?

I detailed in the comment above why that was covered in the limit section (i.e. mostly due to k8s behavior). I can make the change, but I feel this way provides a little more leeway to the limit value and keeps limit and limit_utilization more aligned with each other.

If the sum of container's limits only happens for the utilization metric, then imo the two metrics might feel somewhat inconsistent with each other, if that makes sense.

I agree! My point is mostly about explicitly defining how these ratios are calculated. We can re-use the definition from the k8s.pod.cpu.limit/request to state how the limits/requests are computed.

Got it, I understand it now and that is a great point! I am going to address the comment regarding the limits/requests definition to make it more specific and clear. At that point, will also change the definition here to actually rely more on the other metrics.

github-actions · 2026-04-02T04:06:44Z

This PR has been labeled as stale due to lack of activity. It will be automatically closed if there is no further activity over the next 7 days.

Add cpu metrics for k8s pods and containers

c03b179

github-actions bot added enhancement New feature or request area:k8s labels Mar 12, 2026

github-project-automation bot added this to K8s SemConv SIG Mar 12, 2026

jmmcorreia marked this pull request as ready for review March 13, 2026 09:33

jmmcorreia requested review from a team as code owners March 13, 2026 09:33

jmmcorreia changed the title ~~[DRAFT] Add cpu metrics for k8s pods and containers~~ Add cpu metrics for k8s pods and containers Mar 13, 2026

github-project-automation bot added this to Semantic Conventions Triage Mar 13, 2026

github-project-automation bot moved this to Untriaged in Semantic Conventions Triage Mar 13, 2026

Merge remote-tracking branch 'origin/main' into k8s_metrics

0359a66

ChrsMark reviewed Mar 16, 2026

View reviewed changes

Reduce scope to pod resource metrics

277bbf7

dashpole approved these changes Mar 16, 2026

View reviewed changes

fix trailing space in change log

07aeab8

ChrsMark reviewed Mar 16, 2026

View reviewed changes

lmolkova moved this from Untriaged to Awaiting codeowners approval in Semantic Conventions Triage Mar 17, 2026

lmolkova moved this from Awaiting codeowners approval to Blocked in Semantic Conventions Triage Mar 27, 2026

github-actions bot added the Stale label Apr 2, 2026

lmolkova moved this from Blocked to Draft in Semantic Conventions Triage Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cpu metrics for k8s pods and containers#3539

Add cpu metrics for k8s pods and containers#3539
jmmcorreia wants to merge 4 commits intoopen-telemetry:mainfrom
jmmcorreia:k8s_cpu_metrics

jmmcorreia commented Mar 12, 2026 •

edited

Loading

Uh oh!

ChrsMark left a comment

Uh oh!

jmmcorreia commented Mar 16, 2026 •

edited

Loading

Uh oh!

ChrsMark Mar 16, 2026

Uh oh!

jmmcorreia Mar 16, 2026 •

edited

Loading

Uh oh!

ChrsMark Mar 17, 2026

Uh oh!

jmmcorreia Mar 17, 2026

Uh oh!

ChrsMark Mar 16, 2026

Uh oh!

jmmcorreia Mar 16, 2026

Uh oh!

ChrsMark Mar 17, 2026

Uh oh!

jmmcorreia Mar 17, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jmmcorreia commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NOTE: PR ON HOLD

Changes

Uh oh!

ChrsMark left a comment

Choose a reason for hiding this comment

Uh oh!

jmmcorreia commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrsMark Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

jmmcorreia Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrsMark Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

jmmcorreia Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

ChrsMark Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

jmmcorreia Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

ChrsMark Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

jmmcorreia Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jmmcorreia commented Mar 12, 2026 •

edited

Loading

jmmcorreia commented Mar 16, 2026 •

edited

Loading

jmmcorreia Mar 16, 2026 •

edited

Loading