Skip to content

Optimize prom stats endpoint#44658

Open
nshipilov wants to merge 2 commits intoenvoyproxy:mainfrom
nshipilov:optimize-prom-stats-endpoint
Open

Optimize prom stats endpoint#44658
nshipilov wants to merge 2 commits intoenvoyproxy:mainfrom
nshipilov:optimize-prom-stats-endpoint

Conversation

@nshipilov
Copy link
Copy Markdown

@nshipilov nshipilov commented Apr 24, 2026

Commit Message: Optimize prom stats endpoint.

Additional Description:

This PR optimizes the prometheus stats endpoint. In practice, at Uber, this takes us from ~250-300ms prom stat scrapes to ~125-150ms.

There are a few observations that are used throughout (from top to bottom as viewed in the diff)

(1) ::sanitizeName returns a new string, when we could sanitize names in place in certain scenarios, use a new function ::sanitizeNameInPlace where appropriate.

(2) Primitive*Snapshot types can be modified in place, as they are copies of the underlying metrics from the cluster manager.

For instance, we can use this to modify the already copied, and soon to be destructed, Tags, instead of incurring additional copies in formattedTags().

Similarly, in ::addLabelsToMetric we use this to move the tag name into the proto directly instead of incurring a copy.

(3) Instead of maintaining a rb-tree std::map<...> groups to give a sorted presentation of the stats, we maintain a mapping of string/StatName -> vector<StatType*>, and iterate over the sorted set of keys instead. Creation of this map is significant for cases with lots of groups, and this is the biggest perf improvement in the diff.

(4) PrometheusStatsFormatter::metricName is only ever called with newly allocated strings, so we can, again, sanitize it in place.

Risk Level: Medium

Testing: Benchmark testing, confirming the improvement, and relevant tests in //test/server/admin/prometheus_stats_test.cc

Prior to optimizations:

bazel run -c opt //test/server/admin:stats_handler_speed_test -- --benchmark_filter=.*Prometheus.* --benchmark_repetitions=1

((HEAD detached at 7663e2699b)) nsh@nsh-myenvoy:~/nshipilov/envoy$ bazel run -c opt //test/server/admin:stats_handler_speed_test -- --benchmark_filter=.*Prometheus.* --benchmark_repetitions=1
----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
BM_AllCountersPrometheus/per_endpoint_stats_disabled            4503 ms         4502 ms            1 output per iteration: 260650270
[2026-04-24 21:41:16.628][539802][error] [test/server/admin/stats_handler_speed_test.cc:170] Initializing cluster info; slow to construct and destruct...
BM_AllCountersPrometheus/per_endpoint_stats_enabled             6750 ms         6746 ms            1 output per iteration: 418440799
BM_UsedCountersPrometheus/per_endpoint_stats_disabled            154 ms          154 ms            5 output per iteration: 2660050
BM_UsedCountersPrometheus/per_endpoint_stats_enabled            2625 ms         2624 ms            1 output per iteration: 160450579
BM_FilteredCountersPrometheus/per_endpoint_stats_disabled        512 ms          512 ms            1 output per iteration: 0
BM_FilteredCountersPrometheus/per_endpoint_stats_enabled        1053 ms         1053 ms            1 output per iteration: 0
BM_PrometheusFull/per_endpoint_stats_disabled                   4038 ms         4037 ms            1 output per iteration: 260650270
BM_PrometheusFull/per_endpoint_stats_enabled                    6812 ms         6810 ms            1 output per iteration: 418440799
BM_TraditionalHistogramsPrometheusProtobuf                      3947 ms         3944 ms            1 output per iteration: 138818090 (19 buckets/histogram)
BM_TraditionalHistogramsPrometheusText                          4217 ms         4215 ms            1 output per iteration: 260650270 (19 buckets/histogram)
BM_NativeHistogramsPrometheusProtobuf                           4161 ms         4157 ms            1 output per iteration: 138796890 (max 19 buckets/histogram)

After the improvements:

----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
BM_AllCountersPrometheus/per_endpoint_stats_disabled            2825 ms         2823 ms            1 output per iteration: 260650270
[2026-04-24 23:18:26.583][1641986][error] [test/server/admin/stats_handler_speed_test.cc:170] Initializing cluster info; slow to construct and destruct...
BM_AllCountersPrometheus/per_endpoint_stats_enabled             4432 ms         4432 ms            1 output per iteration: 418440799
BM_UsedCountersPrometheus/per_endpoint_stats_disabled            141 ms          141 ms            5 output per iteration: 2660050
BM_UsedCountersPrometheus/per_endpoint_stats_enabled            1781 ms         1781 ms            1 output per iteration: 160450579
BM_FilteredCountersPrometheus/per_endpoint_stats_disabled        504 ms          504 ms            1 output per iteration: 0
BM_FilteredCountersPrometheus/per_endpoint_stats_enabled        1134 ms         1133 ms            1 output per iteration: 0
BM_PrometheusFull/per_endpoint_stats_disabled                   2523 ms         2522 ms            1 output per iteration: 260650270
BM_PrometheusFull/per_endpoint_stats_enabled                    4354 ms         4353 ms            1 output per iteration: 418440799
BM_TraditionalHistogramsPrometheusProtobuf                      2532 ms         2531 ms            1 output per iteration: 138818090 (19 buckets/histogram)
BM_TraditionalHistogramsPrometheusText                          2528 ms         2527 ms            1 output per iteration: 260650270 (19 buckets/histogram)
BM_NativeHistogramsPrometheusProtobuf                           2445 ms         2444 ms            1 output per iteration: 138796890 (max 19 buckets/histogram)

Docs Changes: None

Release Notes: None

Platform Specific Features: None

Signed-off-by: Nick Shipilov <nick.shipilov.n@gmail.com>
Signed-off-by: Nick Shipilov <nick.shipilov.n@gmail.com>
@nshipilov nshipilov requested a deployment to external-contributors April 24, 2026 23:31 — with GitHub Actions Waiting
@repokitteh-read-only
Copy link
Copy Markdown

Hi @nshipilov, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #44658 was opened by nshipilov.

see: more, trace.

@ravenblackx
Copy link
Copy Markdown
Contributor

@wbpcode you've been looking at stats recently so you're probably the most likely person to spot if there's something awful in here. I think it's okay - it looked like it was moving values out of stats but after some digging I came to the conclusion that it's only moving values out of snapshots, which I think are discarded when done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants