feat: implement pod mutation observability metrics#4644
feat: implement pod mutation observability metrics#4644Arpit529Srivastava wants to merge 15 commits intoopen-telemetry:mainfrom
Conversation
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
|
/cc @swiatekm @iblancasa PTAL. |
iblancasa
left a comment
There was a problem hiding this comment.
Maybe we should add a new option to enable/disable recording these metrics.
I also miss some E2E tests to ensure the feature works properly.
| } | ||
|
|
||
| func (m *PodMutationMetrics) RecordSidecarMutation(ctx context.Context, status, reason, namespace string) { | ||
| if m == nil { |
There was a problem hiding this comment.
I think this can happen for tests, right? How about creating an interface and have a noopMetrics type for the tests?
| attrs := []attribute.KeyValue{ | ||
| attribute.String("mutation_type", "sidecar"), | ||
| attribute.String("status", status), | ||
| attribute.String("namespace", namespace), |
There was a problem hiding this comment.
I'm not sure about this... because of cardinality
There was a problem hiding this comment.
i recommend/think keeping it since it’s critical for debugging and has bounded cardinality. with the addition of the --enable-webhook-metrics flag, this is now opt-in and minimizes risk. happy to discuss other options.
|
|
||
| if inst, err = pm.getInstrumentationInstance(ctx, ns, pod, annotationInjectJava); err != nil { | ||
| // we still allow the pod to be created, but we log a message to the operator's logs | ||
| pm.metrics.RecordInstrumentationMutation(ctx, "error", "lookup_failed", "java", ns.Name) |
There was a problem hiding this comment.
I think we should define some constants:
const (
StatusSuccess = "success"
StatusSkipped = "skipped"
StatusRejected = "rejected"
StatusError = "error"
ReasonAlreadyInstrumented = "already_instrumented"
ReasonFeatureDisabled = "feature_disabled"
ReasonLookupFailed = "lookup_failed"
)| attrs := []attribute.KeyValue{ | ||
| attribute.String("mutation_type", "sidecar"), | ||
| attribute.String("status", status), | ||
| attribute.String("namespace", namespace), |
There was a problem hiding this comment.
Can we use semantic conventions package for those attributes that are already there? Like k8s.namespace.name
| # (Optional) One or more lines of additional information to render under the primary note. | ||
| # These lines will be padded with 2 spaces and then inserted directly into the document. | ||
| # Use pipe (|) for multiline entries. | ||
| subtext: |
There was a problem hiding this comment.
Can you explain here what metrics this PR is adding and other stuff?
| var _ podmutation.PodMutator = (*instPodMutator)(nil) | ||
|
|
||
| func NewMutator(logger logr.Logger, client client.Client, recorder record.EventRecorder, cfg config.Config) podmutation.PodMutator { | ||
| func NewMutator(logger logr.Logger, client client.Client, recorder record.EventRecorder, cfg config.Config, metrics *podmutation.PodMutationMetrics) podmutation.PodMutator { |
There was a problem hiding this comment.
Instead of repeating here for each language.. maybe we can do something more table-driven
| var crdMetrics *otelv1beta1.Metrics | ||
| meterProvider, metricsErr := otelv1beta1.BootstrapMetrics() | ||
| if metricsErr != nil { | ||
| setupLog.Error(metricsErr, "Error bootstrapping metrics") |
There was a problem hiding this comment.
Since we continue, it can happen meterProvider is nil. And that cause a panic later if methods using meterProvider. Can you add checks to avoid that?
There was a problem hiding this comment.
fixed by adding a check for meterProvider != nil before initializing metrics and using the noop metrics pattern to prevent panics in other code paths.
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
f8580e4 to
b5bb8d8
Compare
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
905af90 to
4d99f84
Compare
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
cd369e3 to
23f3710
Compare
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
E2E Test Results 36 files +2 229 suites +2 2h 3m 54s ⏱️ +35s For more details on these failures, see this check. Results for commit d9935fd. ± Comparison against base commit 36d6c75. ♻️ This comment has been updated with latest results. |
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
… rbac issues Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
Signed-off-by: arpit529srivastava <arpitsrivastava529@gmail.com>
|
@iblancasa
what’s the recommended way to verify secured metrics in this e2e environment? should the test disable secure metrics ( |
|
@iblancasa a gentle ping, thanks :) |
Description:
adds observability to the pod mutation webhook by introducing a new prometheus counter
opentelemetry_operator_pod_mutations_total. this allows operators to track success, failure and skip rates for both sidecar injection and auto-instrumentation across the cluster. the metric includes detailed labels such as status, reason, type, language, and namespace, making it easier to distinguish between cases like configuration errors, disabled feature gates, or missing resources.Link to tracking Issue(s):
Testing:
new unit test added at
internal/webhook/podmutation/metrics_test.goto verify that the counter increments correctly with the expected label set.integration checks confirm that both the sidecar and instrumentation mutators invoke the metrics recorder at all critical decision points.
existing unit tests were updated to ensure there are no regressions in the mutator logic.
Documentation: