Update cpu resource metrics to handle resize by jmmcorreia · Pull Request #3559 · open-telemetry/semantic-conventions

jmmcorreia · 2026-03-18T17:21:05Z

Related to #3558

Changes

Modifies the k8s container limit and request metrics to align with resize of CPU metrics as documented in https://kubernetes.io/docs/tasks/configure-pod-container/resize-container-resources/

It fixes the CPU part related to issue #3558. Memory metrics can be fixed in a follow-up PR based on the comments provided for the CPU section

For more details check issue

Merge requirement checklist #3558

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
Links to the prototypes or existing instrumentations (when adding or changing conventions)

ChrsMark

Thank's for digging into this @jmmcorreia! I left some comments but overall I think it goes to right direction!

@open-telemetry/semconv-k8s-approvers PTAL

docs/non-normative/k8s-migration.md

model/k8s/metrics.yaml

ChrsMark · 2026-03-19T12:36:11Z

model/k8s/metrics.yaml

+  - id: metric.k8s.container.cpu.limit.actual
    type: metric
-    metric_name: k8s.container.cpu.request
+    metric_name: k8s.container.cpu.limit.actual


No strong preference here but maybe an alternative could be k8s.container.cpu.limit.current. @open-telemetry/semconv-k8s-approvers @dashpole thoughts?

Just for reference, I did not really had any strong preference, so decided to align the naming scheme with the k8s doc i.e. Key Concepts under https://kubernetes.io/docs/tasks/configure-pod-container/resize-container-resources/

I don't see current or actual used much in other conventions, so I think we are fine to stick with what is closest to the k8s docs.

Thinking about this more, I think I do prefer current. desired and actual make sense when comparing the two (e.g. writing an alert if they differ from each other for a long time), but don't make as much sense when viewing actual in isolation.

E.g. if k8s allowed updating the container image in-place, it would look odd to me to see k8s.container.image.actual, but k8s.container.image.current makes sense.

If we did decide to make the desired request/limit opt-in, I think this would be especially important.

I think your example makes sense. I also took some time to also check what existing metrics seem to be doing and current seems to be the standard. The following k8s metrics are also defined k8s.hpa.pod.current and k8s.hpa.pod.desired, which follow the proposal here. Finally as per the k8s definitions, actual is the currently configured.

I think you make a valid point. Will make the change from actual to current.

model/k8s/metrics.yaml

ChrsMark · 2026-03-19T12:41:31Z

model/k8s/metrics.yaml

+      - k8s.container
+    note: |
+      The value range is [0.0,1.0]. A value of 1.0 means the container is using 100% of its actual CPU request.
+      If the CPU request is not set, this metric SHOULD NOT be emitted for that container.


A question for my own understanding: if the request is not set on the container level by the user but is set on Pod level, will k8s automatically apply this down to all containers or otherwise how the container level resources are affected by the Pod level defined resources?

For a detailed view, you can check the tables under https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#comprehensive-tabular-view as that cover all the possible cases.

So if your question is if by defining pod limits/requests, k8s will also set the container limits/requests in the pod spec then the answer is no. In that case, the container values will remain empty. I also checked on the API to response to see if container actual/current limits would be returned but that does not seem to be the case.

Pod resource limits can be set as the container cgroup max value when no container limit is set, mostly to ensure runtimes such as jvm can read that value and adapt accordingly.

For all other cases, if a value is missing, then the container cgroup value will use the default available. -> My assumption is that basically containers are allowed to use the pod's available resources freely until they are exhausted. Once that happens, enforcement happens in the pod cgroup, which then impacts the child cgroups.

Co-authored-by: Christos Markou <chrismarkou92@gmail.com>

dashpole · 2026-03-24T15:17:15Z

I have a few hesitations with this:

As a user, it isn't nearly as obvious what I should be comparing my CPU and memory usage to. The desired or the actual requests/limits? (answer: actual)
Desired requests/limits are largely going to be a niche metric to diagnose in-place update issues. We could even consider making it opt-in, IMO.

Did we consider something more like k8s.container.cpu.limit vs k8s.container.cpu.desired_limit?

jmmcorreia · 2026-03-25T09:42:00Z

I have a few hesitations with this:

As a user, it isn't nearly as obvious what I should be comparing my CPU and memory usage to. The desired or the actual requests/limits? (answer: actual)

Agreed. The utilization metrics does attempt to make that easier on users by making comparison on the collector side and providing the presentation metric to them.

Desired requests/limits are largely going to be a niche metric to diagnose in-place update issues. We could even consider making it opt-in, IMO.

That is a valid point, I think it makes sense. Probably a lot of users might not attempt to patch the container resources whilst it is running.

Did we consider something more like k8s.container.cpu.limit vs k8s.container.cpu.desired_limit?

Before raising the PR I was stuck on deciding between the implemented approach (i.e. limit.actual and limit.desired) or going with what you proposed. Had a short discussion with @ChrsMark, and going with limit.actual and limit.desired seemed to fit more the semantic conventions approach of creating a namespace for values that fall under a sort of similar category (i.e. going with . instead of _). What I link from your proposal is that I believe it makes things somewhat simpler for users. We give a clear definition on limit and then only the desired limit (that we are making opt-in anyway) uses the _ separator in the name. However, the current approach was picked as it seems to containerize values a little better as it also includes utilization under the limit. Most likely there will be no need to add more values related to the limit, but if it were, the current approach also make that more flexible. IMO both could work, I think I slightly prefer the current approach given the better containerization of the values and I believe this should help prevent any changes to the metrics in the future in case any new ones have to be added (even if not that likely).

@dashpole But happy yo hear your thoughts on this. @ChrsMark if you have any other comments or anything to add I guess that could be helpful.

dashpole · 2026-03-25T13:41:55Z

The point about utilization is a good one. I think maybe switching to current from actual will mostly address my concerns. I reopened the comment thread above.

docs/non-normative/k8s-migration.md

docs/system/k8s-metrics.md

model/k8s/metrics.yaml

jinja2 · 2026-04-01T02:32:41Z

docs/non-normative/k8s-migration.md

-| `k8s.container.ready`                  (type: `gauge`)                             | `k8s.container.ready` (type: `updowncounter`)                     |
-| `k8s.container.cpu_limit_utilization` (type: `gauge`)                              | `k8s.container.cpu.limit_utilization` (type: `gauge`)             |
-| `k8s.container.cpu_request_utilization` (type: `gauge`)                            | `k8s.container.cpu.request_utilization` (type: `gauge`)           |
+| `k8s.container.cpu_limit` (type: `gauge`) | `k8s.container.cpu.limit.desired`, `k8s.container.cpu.limit.current` (type: `updowncounter`) |


Another option to consider: we keep the existing k8s.container.cpu.limit (it maps to spec.containers[*].resources) and introduce an additive metric like k8s.container.cpu.effective_limit sourced from status.containerStatuses[*].resources. When resize isn't in use or the cluster doesn't support it, implementors can simply not emit the effective metric - there is no migration burden for existing users, and the new metric is only relevant when there's actually a difference to observe.

@dashpole had a very similar suggestion earlier in the thread and I also had discussed this with @ChrsMark.

Going with the current approach allow to neatly group desired, actual, and utilization metrics under limit and request. Moreover, going with this seems to be somewhat more aligned with the more recent semantic conventions, where _ is less used.

Also, imo having limit and request as namespaces could potentially help easily add more values in the future if needed without impacting other metrics (instead of what is happening now).

Finally, I have marked this item as an enhancement instead of breaking changes mostly because none of this metrics seems to be used anywhere, at least not in the latest format. For example, k8scluster receiver still uses k8s.container.cpu_limit and kubeletstats receiver uses k8s.container.cpu_limit_utilization: for instance. Hence, if there is a moment to making a more drastic but potentially future proof change, I would say now is the time, since these have not been adopted yet.

Personally I prefer to keep the metrics in the current proposed format. But happy to hear your thoughts on this take. The same thing will have to be done for container memory and then pod cpu and memory. Hence, it would be nice we can close on a design we all agree on.

Thanks for the additional context, the desired/current split works for me.

One last concern - I believe the new status field used for current metric is gated on InPlacePodVerticalScaling and only populated when the feature gate is enabled (beta/on-by-default in 1.33, GA in 1.35). On older clusters the "recommended" current metric can't be emitted while the "opt-in" desired can. I guess implementation can fallback to spec but might be worth adding a note about this and the K8s version requirement in the metric description. I don't think we've had a discussion about how to handle K8s version requirements in semconv before.

Also, just an observation, K8s 1.33+ exposes PodResizePending and PodResizeInProgress as pod conditions. Implementing this open issue for a k8s.pod.condition.status metric would round off the resize observability usecase nicely, letting users correlate desired/current divergence with the actual resize state.

That is a great point. I aligned this proposal with the oldest active version, which is 1.33 (See eol). If that is deemed enough (i.e. only support current active versions), then no special behavior needs to be accounted for at the container level (but it might have to be done for pods, since this feature is more recent there).

We could also document that, for older k8s releases, the only workaround could be to enable the "opt-in" desired metric. It puts more burden on customer side, but since they would be running a non active k8s release, I wonder if that could be an acceptable trade-off.

Co-authored-by: Jina Jain <jjain@splunk.com>

ChrsMark

Thank's @jmmcorreia!

Update cpu resource metrics to handle resize

896027c

jmmcorreia requested review from a team as code owners March 18, 2026 17:21

github-project-automation bot added this to Semantic Conventions Triage Mar 18, 2026

github-project-automation bot moved this to Untriaged in Semantic Conventions Triage Mar 18, 2026

github-actions bot added enhancement New feature or request area:k8s labels Mar 18, 2026

github-project-automation bot added this to K8s SemConv SIG Mar 18, 2026

Fix lint and deprecated

12bcc6e

ChrsMark moved this to In Review in K8s SemConv SIG Mar 19, 2026

ChrsMark moved this from Untriaged to Awaiting codeowners approval in Semantic Conventions Triage Mar 19, 2026

ChrsMark reviewed Mar 19, 2026

View reviewed changes

jmmcorreia and others added 2 commits March 19, 2026 17:35

Update model/k8s/metrics.yaml

8f44c73

Co-authored-by: Christos Markou <chrismarkou92@gmail.com>

Address review comments

be5a337

rename actual to current

d5adc63

dashpole approved these changes Mar 25, 2026

View reviewed changes

jinja2 reviewed Apr 1, 2026

View reviewed changes

jmmcorreia and others added 4 commits April 1, 2026 14:49

Fix actual to current in migration node

41af114

Co-authored-by: Jina Jain <jjain@splunk.com>

Fix brief message

08853e8

Co-authored-by: Jina Jain <jjain@splunk.com>

Review comments

092336c

Merge remote-tracking branch 'origin' into k8s_container_metrics

4ca2315

ChrsMark approved these changes Apr 6, 2026

View reviewed changes

lmolkova moved this from Awaiting codeowners approval to Needs More Approval in Semantic Conventions Triage Apr 6, 2026

Conversation

jmmcorreia commented Mar 18, 2026

Changes

Merge requirement checklist #3558

Uh oh!

ChrsMark left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dashpole commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmmcorreia commented Mar 25, 2026

Uh oh!

dashpole commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinja2 Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrsMark left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dashpole commented Mar 24, 2026 •

edited

Loading

jinja2 Apr 2, 2026 •

edited

Loading