Skip to content

feat: implement pod mutation observability metrics#4644

Open
Arpit529Srivastava wants to merge 15 commits intoopen-telemetry:mainfrom
Arpit529Srivastava:pod-mutation-metrics
Open

feat: implement pod mutation observability metrics#4644
Arpit529Srivastava wants to merge 15 commits intoopen-telemetry:mainfrom
Arpit529Srivastava:pod-mutation-metrics

Conversation

@Arpit529Srivastava
Copy link
Copy Markdown
Contributor

Description:

adds observability to the pod mutation webhook by introducing a new prometheus counter opentelemetry_operator_pod_mutations_total. this allows operators to track success, failure and skip rates for both sidecar injection and auto-instrumentation across the cluster. the metric includes detailed labels such as status, reason, type, language, and namespace, making it easier to distinguish between cases like configuration errors, disabled feature gates, or missing resources.

Link to tracking Issue(s):

Testing:
new unit test added at internal/webhook/podmutation/metrics_test.go to verify that the counter increments correctly with the expected label set.

integration checks confirm that both the sidecar and instrumentation mutators invoke the metrics recorder at all critical decision points.

existing unit tests were updated to ensure there are no regressions in the mutator logic.

Documentation:

Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
@Arpit529Srivastava Arpit529Srivastava requested a review from a team as a code owner January 22, 2026 15:44
@Arpit529Srivastava
Copy link
Copy Markdown
Contributor Author

/cc @swiatekm @iblancasa PTAL.
thanks.

Copy link
Copy Markdown
Contributor

@iblancasa iblancasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a new option to enable/disable recording these metrics.
I also miss some E2E tests to ensure the feature works properly.

}

func (m *PodMutationMetrics) RecordSidecarMutation(ctx context.Context, status, reason, namespace string) {
if m == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can happen for tests, right? How about creating an interface and have a noopMetrics type for the tests?

attrs := []attribute.KeyValue{
attribute.String("mutation_type", "sidecar"),
attribute.String("status", status),
attribute.String("namespace", namespace),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this... because of cardinality

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend/think keeping it since it’s critical for debugging and has bounded cardinality. with the addition of the --enable-webhook-metrics flag, this is now opt-in and minimizes risk. happy to discuss other options.


if inst, err = pm.getInstrumentationInstance(ctx, ns, pod, annotationInjectJava); err != nil {
// we still allow the pod to be created, but we log a message to the operator's logs
pm.metrics.RecordInstrumentationMutation(ctx, "error", "lookup_failed", "java", ns.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should define some constants:

  const (
      StatusSuccess  = "success"
      StatusSkipped  = "skipped"
      StatusRejected = "rejected"
      StatusError    = "error"

      ReasonAlreadyInstrumented = "already_instrumented"
      ReasonFeatureDisabled     = "feature_disabled"
      ReasonLookupFailed        = "lookup_failed"
  )

attrs := []attribute.KeyValue{
attribute.String("mutation_type", "sidecar"),
attribute.String("status", status),
attribute.String("namespace", namespace),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use semantic conventions package for those attributes that are already there? Like k8s.namespace.name

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain here what metrics this PR is adding and other stuff?

var _ podmutation.PodMutator = (*instPodMutator)(nil)

func NewMutator(logger logr.Logger, client client.Client, recorder record.EventRecorder, cfg config.Config) podmutation.PodMutator {
func NewMutator(logger logr.Logger, client client.Client, recorder record.EventRecorder, cfg config.Config, metrics *podmutation.PodMutationMetrics) podmutation.PodMutator {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating here for each language.. maybe we can do something more table-driven

var crdMetrics *otelv1beta1.Metrics
meterProvider, metricsErr := otelv1beta1.BootstrapMetrics()
if metricsErr != nil {
setupLog.Error(metricsErr, "Error bootstrapping metrics")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we continue, it can happen meterProvider is nil. And that cause a panic later if methods using meterProvider. Can you add checks to avoid that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed by adding a check for meterProvider != nil before initializing metrics and using the noop metrics pattern to prevent panics in other code paths.

Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 23, 2026

E2E Test Results

 36 files  +2  229 suites  +2   2h 3m 54s ⏱️ +35s
 91 tests +1   90 ✅ ±0  0 💤 ±0  1 ❌ +1 
233 runs  +2  231 ✅ ±0  0 💤 ±0  2 ❌ +2 

For more details on these failures, see this check.

Results for commit d9935fd. ± Comparison against base commit 36d6c75.

♻️ This comment has been updated with latest results.

Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
… rbac issues

Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
@Arpit529Srivastava
Copy link
Copy Markdown
Contributor Author

@iblancasa
i’m running into issues verifying webhook metrics in the e2e tests due to auth and permission constraints in the test environment. i’ve tried a few common approaches, but each one hits a blocker:

  1. service/pod proxy (kubectl get --raw)
    tried using the api server proxy to fetch metrics (for example, /api/v1/namespaces/.../pods/https:${POD_NAME}:8443/proxy/metrics).
    result: fails with error: you must be logged in to the server.
    reason: the test runner’s serviceaccount appears to be missing permissions for the pods/proxy or services/proxy subresources.

  2. port-forward (kubectl port-forward + curl) tried bypassing the api server by port-forwarding locally and curling the endpoint. result: returns 401 unauthorized. reason: the operator runs with --metrics-secure=true` by default, so the metrics endpoint rejects unauthenticated requests.

what’s the recommended way to verify secured metrics in this e2e environment? should the test disable secure metrics (--metrics-secure=false), or is there a supported way (token / auth mechanism) to authenticate against the metrics endpoint?

@Arpit529Srivastava
Copy link
Copy Markdown
Contributor Author

@iblancasa a gentle ping, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Operator Pod Mutation Observability

2 participants