Serve /metrics/custom from the catalog (2/3)#36751
Open
alex-hunt-materialize wants to merge 7 commits into
Open
Serve /metrics/custom from the catalog (2/3)#36751alex-hunt-materialize wants to merge 7 commits into
/metrics/custom from the catalog (2/3)#36751alex-hunt-materialize wants to merge 7 commits into
Conversation
af8c295 to
232e2da
Compare
def-
reviewed
Jun 2, 2026
def-
left a comment
Contributor
There was a problem hiding this comment.
If you create a metric that isn't a valid Prometheus identifier we panic:
diff --git a/test/http-auth/mzcompose.py b/test/http-auth/mzcompose.py
index f533adfb6d..f5e05c4820 100644
--- a/test/http-auth/mzcompose.py
+++ b/test/http-auth/mzcompose.py
@@ -7,6 +7,10 @@
# the Business Source License, use of this software will be governed
# by the Apache License, Version 2.0.
+import time
+from dataclasses import dataclass
+
+import psycopg
import requests
from materialize import MZ_ROOT
@@ -17,6 +21,62 @@ SERVICES = [
Materialized(),
]
+# The canonical no-auth CI listener config enables the `endpoint_api` route
+# (`/metrics/custom`) on the external HTTP listener (6876), like CI/orchestratord.
+NO_AUTH_LISTENERS = f"{MZ_ROOT}/src/materialized/ci/listener_configs/no_auth.json"
+
+
+@dataclass
+class MetricTrigger:
+ case: str # test-case suffix + unique object-name seed
+ view_body: str # the view's non-value columns become the metric's labels
+ metric_ident: str # becomes the Prometheus metric name verbatim
+ help_text: str # the `HELP '...'` literal
+ why: str
+
+
+# `CREATE METRIC` accepts each of these, but the resulting Prometheus `Desc` is
+# invalid, so `MetricsRegistry::register`'s `.expect("defining a gauge vec")`
+# (src/ore/src/metrics.rs) panics on the next scrape. All views expose `count`.
+METRIC_TRIGGERS = [
+ MetricTrigger(
+ "invalid_metric_name",
+ "SELECT 1::int8 AS count",
+ '"http-requests"',
+ "h",
+ "a hyphen in the metric name",
+ ),
+ MetricTrigger(
+ "invalid_label_name",
+ "SELECT 1::int8 AS count, 'ok'::text AS \"2xx\"",
+ "responses",
+ "h",
+ "a label name starting with a digit (2xx)",
+ ),
+ MetricTrigger(
+ "empty_help",
+ "SELECT 1::int8 AS count",
+ "leads",
+ "",
+ "empty HELP text",
+ ),
+]
+
+
+def environmentd_alive(c: Composition, timeout: float = 20.0) -> bool:
+ """Whether environmentd answers `SELECT 1`. On the buggy path it aborted and,
+ with no restart policy, stays down — so this exhausts `timeout`."""
+ deadline = time.monotonic() + timeout
+ while True:
+ try:
+ if c.sql_query("SELECT 1", reuse_connection=False)[0][0] == 1:
+ return True
+ except Exception:
+ pass
+ if time.monotonic() >= deadline:
+ return False
+ time.sleep(1)
+
def workflow_default(c: Composition) -> None:
with c.override(
@@ -98,3 +158,65 @@ def workflow_default(c: Composition) -> None:
"permission denied"
in r.json()["results"][0]["error"]["message"].lower()
)
+
+ # `handle_metrics_custom` registers a Prometheus gauge vector per metric via
+ # `MetricsRegistry::register`, which panics — rather than 500s — on an invalid
+ # `Desc` (see METRIC_TRIGGERS). environmentd's enhanced panic handler turns
+ # that into `process::abort()`, so one scrape takes the whole process down,
+ # and re-aborts on every later scrape since the bad metric lives in the
+ # catalog. The `no_auth` listener exposes `endpoint_api` like CI/orchestratord.
+ # Invariant (fix-agnostic): such a scrape must not abort environmentd — a
+ # plan-time rejection or a graceful HTTP error are both fine.
+ with c.override(Materialized(listeners_config_path=NO_AUTH_LISTENERS)):
+ for trig in METRIC_TRIGGERS:
+ with c.test_case(f"metrics_custom_{trig.case}"):
+ # Boot, or recover from the previous case's abort (env boots fine
+ # with a bad metric in the catalog; it only crashes on scrape).
+ c.up("materialized")
+ assert environmentd_alive(c), f"environmentd down before {trig.case}"
+
+ # RBAC is irrelevant to the panic; disabling it lets the anonymous
+ # HTTP user reach registration without grants. Re-applied (with
+ # idempotent drops) so each case is independent of a crash+restart.
+ c.sql(
+ "ALTER SYSTEM SET enable_prometheus_metrics_api = true;"
+ " ALTER SYSTEM SET enable_rbac_checks = false;",
+ user="mz_system",
+ port=6877,
+ print_statement=False,
+ )
+ view, api = f"v_{trig.case}", f"api_{trig.case}"
+ c.sql(
+ f"DROP API IF EXISTS materialize.public.{api} CASCADE;"
+ f" DROP VIEW IF EXISTS materialize.public.{view} CASCADE;"
+ f" CREATE VIEW materialize.public.{view} AS {trig.view_body};"
+ f" CREATE API materialize.public.{api} FORMAT PROMETHEUS;",
+ print_statement=False,
+ )
+ try:
+ c.sql(
+ f"CREATE METRIC materialize.public.{trig.metric_ident}"
+ f" IN API materialize.public.{api} AS (TYPE 'gauge',"
+ f" HELP '{trig.help_text}',"
+ f" VALUES FROM materialize.public.{view},"
+ f" VALUE COLUMN 'count')",
+ print_statement=False,
+ )
+ except psycopg.Error as e:
+ # Rejected at plan time — the dangerous object never exists.
+ print(f"[{trig.case}] rejected at CREATE METRIC ({trig.why}): {e}")
+ continue
+
+ # On the buggy path the abort resets the connection mid-request.
+ url = (
+ f"http://localhost:{c.port('materialized', 6876)}"
+ f"/metrics/custom/materialize/public/{api}"
+ )
+ try:
+ requests.get(url, timeout=30)
+ except requests.exceptions.RequestException:
+ pass
+ assert environmentd_alive(c), (
+ f"environmentd aborted after scraping a metric with {trig.why}"
+ " — handle_metrics_custom panicked in MetricsRegistry::register"
+ )Fails with:
thread 'tokio:work-6' panicked at src/ore/src/metrics.rs:321:41:
defining a gauge vec: Msg("'http-requests' is not a valid metric name")
5: core::panicking::panic_fmt
6: core::result::unwrap_failed
7: <prometheus::vec::MetricVec<prometheus::gauge::GaugeVecBuilder<prometheus::atomic64::AtomicF64>> as mz_ore::metrics::MakeCollector>::make_collector
8: <mz_ore::metrics::MetricsRegistry>::register::<prometheus::vec::MetricVec<prometheus::gauge::GaugeVecBuilder<prometheus::atomic64::AtomicF64>>>
9: <alloc::collections::btree::map::entry::Entry<alloc::string::String, prometheus::vec::MetricVec<prometheus::gauge::GaugeVecBuilder<prometheus::atomic64::AtomicF64>>>>::or_insert_with::<mz_environmentd::http::sql::handle_metrics_custom::{closure#0}::{closure#4}>
10: <mz_environmentd::http::sql::handle_metrics_custom as axum::handler::Handler<(axum_core::extract::private::ViaParts, mz_environmentd::http::AuthedClient, axum::extract::path::Path<(alloc::string::String, alloc::string::String, alloc::string::String)>), ()>>::call::{closure#0}
11: <futures_util::future::future::map::Map<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = http::response::Response<axum_core::body::Body>> + core::marker::Send>>, fn(http::response::Response<axum_core::body::Body>) -> core::result::Result<http::response::Response<axum_core::body::Body>, core::convert::Infallible>> as core::future::future::Future>::poll
12: <axum::util::MapIntoResponseFuture<axum::handler::future::IntoServiceFuture<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = http::response::Response<axum_core::body::Body>> + core::marker::Send>>>> as core::future::future::Future>::poll
13: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
14: <axum::util::MapIntoResponseFuture<axum::routing::route::RouteFuture<core::convert::Infallible>> as core::future::future::Future>::poll
15: <axum::middleware::from_fn::Next>::run::{closure#0}
16: mz_environmentd::http::http_auth::{closure#0}
17: <axum::middleware::from_fn::FromFn<<mz_environmentd::http::HttpServer>::new::{closure#0}, (), axum::routing::route::Route, (http::request::Request<axum_core::body::Body>,)> as tower_service::Service<http::request::Request<axum_core::body::Body>>>::call::{closure#0}
18: <axum::middleware::from_fn::ResponseFuture as core::future::future::Future>::poll
19: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::middleware::from_fn::ResponseFuture>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
20: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::middleware::from_fn::ResponseFuture, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
21: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
22: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
23: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::routing::route::RouteFuture<core::convert::Infallible>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
24: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
25: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
26: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::routing::route::RouteFuture<core::convert::Infallible>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
27: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
28: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
29: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::routing::route::RouteFuture<core::convert::Infallible>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
30: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
31: <mz_environmentd::http::metrics::PrometheusFuture<axum::routing::route::RouteFuture<core::convert::Infallible>> as core::future::future::Future>::poll
32: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<mz_environmentd::http::metrics::PrometheusFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
33: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<mz_environmentd::http::metrics::PrometheusFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
34: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
35: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
36: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::routing::route::RouteFuture<core::convert::Infallible>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
37: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
38: <hyper::proto::h1::dispatch::Dispatcher<hyper::proto::h1::dispatch::Server<hyper::service::util::ServiceFn<<mz_environmentd::http::HttpServer as mz_server_core::Server>::handle_connection<tokio_metrics::task::TaskIntervals>::{closure#0}::{closure#1}, hyper::body::incoming::Incoming>, hyper::body::incoming::Incoming>, axum_core::body::Body, hyper_openssl::client::legacy::MaybeHttpsStream<hyper_util::rt::tokio::TokioIo<mz_server_core::Connection>>, hyper::proto::h1::role::Server>>::poll_catch
39: <hyper::server::conn::http1::UpgradeableConnection<hyper_openssl::client::legacy::MaybeHttpsStream<hyper_util::rt::tokio::TokioIo<mz_server_core::Connection>>, hyper::service::util::ServiceFn<<mz_environmentd::http::HttpServer as mz_server_core::Server>::handle_connection<tokio_metrics::task::TaskIntervals>::{closure#0}::{closure#1}, hyper::body::incoming::Incoming>> as core::future::future::Future>::poll
40: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<hyper::server::conn::http1::UpgradeableConnection<hyper_openssl::client::legacy::MaybeHttpsStream<hyper_util::rt::tokio::TokioIo<mz_server_core::Connection>>, hyper::service::util::ServiceFn<<mz_environmentd::http::HttpServer as mz_server_core::Server>::handle_connection<tokio_metrics::task::TaskIntervals>::{closure#0}::{closure#1}, hyper::body::incoming::Incoming>>>, futures_util::fns::MapErrFn<futures_util::fns::IntoFn<anyhow::Error>>> as core::future::future::Future>::poll
41: <mz_environmentd::http::HttpServer as mz_server_core::Server>::handle_connection::<tokio_metrics::task::TaskIntervals>::{closure#0}
42: <tokio_metrics::task::Instrumented<mz_server_core::serve<mz_environmentd::http::HttpServer, core::pin::Pin<alloc::boxed::Box<dyn mz_server_core::ConnectionStream<Item = core::result::Result<tokio::net::tcp::stream::TcpStream, std::io::error::Error>>>>>::{closure#0}::{closure#3}> as core::future::future::Future>::poll
43: <tracing::instrument::Instrumented<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = ()> + core::marker::Send>>> as core::future::future::Future>::poll
44: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = ()> + core::marker::Send>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::poll
45: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = ()> + core::marker::Send>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::poll
46: <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task
47: <tokio::runtime::scheduler::multi_thread::worker::Context>::run
48: tokio::runtime::context::runtime::enter_runtime::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}, ()>
49: tokio::runtime::scheduler::multi_thread::worker::run
50: <tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>> as core::future::future::Future>::poll
51: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll
52: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll
53: <tokio::runtime::task::UnownedTask<tokio::runtime::blocking::schedule::BlockingSchedule>>::run
54: <tokio::runtime::blocking::pool::Inner>::run
558f4df to
004d0f8
Compare
Add grammar, AST nodes, and parser test coverage for the new Prometheus-metrics DDL: * CREATE API <name> FORMAT PROMETHEUS [IN CLUSTER <c>] * DROP API <name> * CREATE METRIC <name> IN API <api> AS (TYPE, HELP, VALUES FROM, VALUE COLUMN) * DROP METRIC <name> * GRANT/REVOKE ... ON API <name> Adds the Api, Help, Listener, Metric, and Prometheus keywords. No planning or execution yet; statements parse but are not accepted by the planner. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Introduce Api and Metric as first-class catalog items so CREATE API / CREATE METRIC can persist: * New CatalogItem::Api / CatalogItem::Metric in-memory objects, durable protobuf objects (objects_v86), and the v85->v86 upgrade migration. * mz_catalog builtin entries (mz_apis, mz_metrics) and OID allocations. * parse_catalog_create_sql learns to render CREATE API / CREATE METRIC back to JSON (stamping cluster_id / api_id) for catalog introspection. * audit-log event types for the new items. This only defines the catalog plumbing; nothing creates these items yet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the planning and execution path for the new statements:
* Plan/Sequence: plan_create_api allocates the qualified name, resolves
the cluster (stamping the resolved cluster_id into the AST so persisted
SQL rehydrates after ALTER CLUSTER RENAME). plan_create_metric resolves
the IN API target (must be a CatalogItemType::Api), validates VALUES FROM
is a relation containing a numeric VALUE COLUMN, and enforces TYPE = gauge
for now. The sequencer routes both into transact.
* Name resolution: the Aug pass resolves "IN API <api>" and "VALUES FROM
<view>" to CatalogItemIds, so the planner relies on resolved targets and
the dependency tracker records Metric -> {Api, View} edges.
* RBAC: CREATE API requires USAGE on the cluster (not CREATE); CREATE METRIC
requires USAGE on the API and SELECT on the VALUES FROM relation. GRANT/
REVOKE ... ON API is wired through acl planning. DROP follows standard
item RBAC.
* Feature flag: both statements are gated on enable_prometheus_metrics_api
(LD-controlled), off by default.
* sqllogictest coverage for parsing, planning errors, RBAC, dependency
enforcement, and round-trip.
Also handles the new ExecuteResponse::Created{Api,Metric} variants in the
HTTP SQL result match so environmentd still compiles; the metrics endpoint
that serves these objects lands separately.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A metric's name, HELP text, and label names (its relation's non-value columns) are handed verbatim to Prometheus at scrape time, where `MetricsRegistry::register` *panics* on an invalid identifier or empty HELP. Validate at CREATE METRIC time instead, by constructing a real `prometheus::Desc` so the rules can never drift from what `register` accepts. A column added to a table that feeds a metric becomes a new label, so ALTER TABLE ADD COLUMN revalidates the new column name the same way (a freshly added column is always a label, never the pre-existing value column). MV apply-replacement needs no such check: replacements are required to be schema-identical to their target, so a metric's labels and value column can't change underneath it; a comment notes this for any future schema-evolution work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the HTTP endpoint that exposes APIs created with CREATE API:
* GET /metrics/custom/{database}/{schema}/{name} resolves the API from the
catalog, peeks every metric's VALUES FROM relation on the API's cluster,
and renders the Prometheus text exposition format.
* The route is gated per-listener by a new endpoint_api flag on
HttpRoutesEnabled, wired through orchestratord, the test harness, the
sqllogictest runner, and the no_auth listener config.
* Scrape-time RBAC: each request runs as the authenticated role and must
hold USAGE on the API and SELECT on each metric relation. Enforcement
reuses plan-path gating via the new rbac::is_rbac_enforced_for_session
helper, so the endpoint's checks match CREATE-time validation exactly.
* environmentd server tests cover happy path, empty exposures, and the
USAGE-on-API / SELECT-on-relation authorization paths.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`handle_metrics_custom` registers a Prometheus gauge vector per metric; plain `register` panics on an invalid `Desc` (an invalid metric or label name, empty HELP) and on a failed registry insertion. CREATE METRIC and ALTER TABLE now reject names/labels/HELP that Prometheus would refuse, but a metric's relation can still change shape underneath it (e.g. a source schema change), so the scrape path must not panic on a bad metric. Add a fallible `MetricsRegistry::try_register` (backed by a new `MakeCollector::try_make_collector`) that returns the underlying `prometheus::Error` instead of panicking, and use it in `handle_metrics_custom` to skip the offending metric with a warning rather than aborting the whole scrape. `register` now delegates to `try_register` and keeps its panic-on-invalid contract for statically-defined metrics. Adds an http-auth regression test asserting that scraping such a metric never takes environmentd down. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
004d0f8 to
a6094e1
Compare
Contributor
Author
|
@def- Thanks for catching that. The fix is split over an additional commit in each PR, since some is SQL and some is serving. |
|
We'd love to have this at Notion Labs, Inc |
Alphadelta14
reviewed
Jun 8, 2026
| } | ||
| }; | ||
| metrics.push(ResolvedMetric { | ||
| name: entry.name().item.clone(), |
Member
There was a problem hiding this comment.
Recommend namespacing these a little bit.
Alloy (as an example of another thing that supports custom metrics) puts theirs behind loki_process_custom_. I think we should do a similar mz_custom_ prefix.
| // for the first Err and bubble it up. | ||
| for r in results { | ||
| if let SqlResult::Err { error, .. } = r { | ||
| let status = if error.code == "42501" { |
Member
There was a problem hiding this comment.
is there a more maintainable way to handle this check?
Namespace user-defined metrics in the Prometheus exposition so they cannot collide with Materialize's built-in mz_* metrics. The prefix is injected at scrape time in handle_metrics_custom; the user-supplied name is unchanged in the catalog and mz_metrics. No extra validation needed: the prometheus crate validates names by charset only (no length limit), and mz_custom_ is all valid characters, so a valid user name stays valid once prefixed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9d9b008 to
d328be2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
PR 1/3 added the SQL to define an
APIand itsMETRICs, but nothing serves them yet. This PR adds the HTTP endpoint that makes the feature actually usable: a Prometheus scrape target that evaluates user-defined metrics on demand.Design doc: #34544.
Still gated behind
enable_prometheus_metrics_api(default off) and, additionally, behind a per-listener route flag — so it stays off until both the flag is enabled and a listener opts in.Description
GET /metrics/custom/{database}/{schema}/{name}resolves the API from the catalog, peeks every metric'sVALUES FROMrelation on the API's cluster, and renders the result in the Prometheus text exposition format. Empty exposures (an API with no metrics, or metrics over an empty relation) return200 OKwith no samples.endpoint_apiflag onHttpRoutesEnabledcontrols whether the route is mounted. APIs are not bound to a specific listener; the route is served on every listener that sets the flag. Wired through orchestratord, the test harness, the sqllogictest runner, and theno_authlistener config.USAGEon the API andSELECTon each metric's relation. Enforcement reuses the plan-path gating logic via a newrbac::is_rbac_enforced_for_sessionhelper, so the endpoint's checks matchCREATE-time validation exactly (rather than re-deriving the superuser / RBAC-disabled / system-role rules and risking drift). MissingUSAGEon the API returns404; missingSELECTon a relation returns403.Verification
environmentdserver tests (src/environmentd/tests/server.rs) cover the happy path, empty exposures, and the authorization paths (USAGE-on-API andSELECT-on-relation), including the unauthenticatedanonymous_http_usercase on aNone-authenticator listener.