Skip to content

Add OpenTelemetry metrics support for RelativeLoadBalancerStrategy#1146

Draft
aadityaraj7769 wants to merge 8 commits intolinkedin:masterfrom
aadityaraj7769:adiraj/relativeloadbalancerstrategy-otel-migration
Draft

Add OpenTelemetry metrics support for RelativeLoadBalancerStrategy#1146
aadityaraj7769 wants to merge 8 commits intolinkedin:masterfrom
aadityaraj7769:adiraj/relativeloadbalancerstrategy-otel-migration

Conversation

@aadityaraj7769
Copy link
Copy Markdown
Contributor

Summary

This PR adds OpenTelemetry (OTel) metrics instrumentation to the RelativeLoadBalancerStrategy.

Changes

  1. New Interface: Added RelativeLoadBalancerStrategyOtelMetricsProvider interface for collecting load balancer metrics via OpenTelemetry
  • Per-call host latency measurements (recorded via a per-call duration listener on each TrackerClient)
  • Host health tracking (total hosts, unhealthy hosts, quarantined hosts)
  • Hash ring sizing
  1. No-op Implementation: Added NoOpRelativeLoadBalancerStrategyOtelMetricsProvider as default implementation when metrics are disabled

  2. Integration:

  • Integrated metrics provider into StateUpdater constructor with dependency injection pattern
  • Per-call host latency is emitted via TrackerClient.setPerCallDurationListener(), which fires on every individual request — allowing OTel to automatically compute percentiles (p50, p90, p99), averages, min, max, and standard deviation
  • Gauge metrics (host counts, quarantine counts, hash ring points) are emitted after every scheduled partition state update via emitOtelMetrics()
  • Added constructor overload in RelativeLoadBalancerStrategyFactory to accept a custom RelativeLoadBalancerStrategyOtelMetricsProvider
  • All metrics are tagged with two dimensions: serviceName and scheme
  1. Metrics Tracked:
  • Histogram: per-host per-call latency
  • Gauges: total hosts across all partitions, unhealthy host count, quarantine host count, total points in hash ring

Backward Compatibility

  • Fully backward compatible — all existing constructors default to NoOpRelativeLoadBalancerStrategyOtelMetricsProvider

About RelativeLoadBalancerStrategySensor

The RelativeLoadBalancerStrategy Sensor tracks metrics for D2's relative load balancer, which dynamically adjusts host health scores (0.0–1.0) based on per-host call statistics relative to the overall cluster. It monitors host latency distributions, cluster health composition, and hash ring sizing. This sensor enables server-side observability for how traffic is being distributed across hosts in a service cluster.

New OTel Metrics

Metric Naming Pattern: `D2.RelativeLb.<Metric>

Dimensions

Attribute Key Description Applied To
D2.Service.Name The service being load-balanced All metrics
D2.Scheme Load balancer scheme (e.g., http, https) All metrics
D2.Host.Status &quot;Unhealthy&quot; or &quot;Quarantine&quot; DegradedHostsCount only

ExponentialHistogram

  • D2.RelativeLb.HostLatency (ms) — Records each host's average latency per call. OTel automatically computes p50, p90, p99, average, min, max, and standard deviation from the distribution.

Gauges

  • D2.RelativeLb.AllPartitionHostsCount ({host}) — Total number of hosts across all partitions regardless of health status; the full host population the load balancer is aware of
  • D2.RelativeLb.DegradedHostsCount ({host}) — Number of hosts in a degraded state, grouped by D2.Host.Status:
    • D2.Host.Status = &quot;Unhealthy&quot; — hosts whose health score has been reduced due to high latency or error rate
    • D2.Host.Status = &quot;Quarantine&quot; — hosts currently in quarantine pending health check recovery
  • D2.RelativeLb.PointsInHashRing ({point}) — Total number of points in the consistent hash ring, reflecting the effective traffic weight distribution across hosts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant