Skip to content

Serve /metrics/custom from the catalog (2/3)#36751

Open
alex-hunt-materialize wants to merge 7 commits into
MaterializeInc:mainfrom
alex-hunt-materialize:prometheus_sql_metrics
Open

Serve /metrics/custom from the catalog (2/3)#36751
alex-hunt-materialize wants to merge 7 commits into
MaterializeInc:mainfrom
alex-hunt-materialize:prometheus_sql_metrics

Conversation

@alex-hunt-materialize

Copy link
Copy Markdown
Contributor

Stacked on 1/3 (CREATE/DROP API and CREATE/DROP METRIC: SQL surface + catalog). Review/merge that first; this PR is based on it, not on main.

Motivation

PR 1/3 added the SQL to define an API and its METRICs, but nothing serves them yet. This PR adds the HTTP endpoint that makes the feature actually usable: a Prometheus scrape target that evaluates user-defined metrics on demand.

Design doc: #34544.

Still gated behind enable_prometheus_metrics_api (default off) and, additionally, behind a per-listener route flag — so it stays off until both the flag is enabled and a listener opts in.

Description

  • EndpointGET /metrics/custom/{database}/{schema}/{name} resolves the API from the catalog, peeks every metric's VALUES FROM relation on the API's cluster, and renders the result in the Prometheus text exposition format. Empty exposures (an API with no metrics, or metrics over an empty relation) return 200 OK with no samples.
  • Per-listener gating — a new endpoint_api flag on HttpRoutesEnabled controls whether the route is mounted. APIs are not bound to a specific listener; the route is served on every listener that sets the flag. Wired through orchestratord, the test harness, the sqllogictest runner, and the no_auth listener config.
  • Scrape-time RBAC — each request runs as the role that authenticated the HTTP request, not the API owner. That role must hold USAGE on the API and SELECT on each metric's relation. Enforcement reuses the plan-path gating logic via a new rbac::is_rbac_enforced_for_session helper, so the endpoint's checks match CREATE-time validation exactly (rather than re-deriving the superuser / RBAC-disabled / system-role rules and risking drift). Missing USAGE on the API returns 404; missing SELECT on a relation returns 403.

Verification

  • environmentd server tests (src/environmentd/tests/server.rs) cover the happy path, empty exposures, and the authorization paths (USAGE-on-API and SELECT-on-relation), including the unauthenticated anonymous_http_user case on a None-authenticator listener.
  • This PR's tip is the full feature; the branch builds and the new tests pass.

@alex-hunt-materialize alex-hunt-materialize force-pushed the prometheus_sql_metrics branch 3 times, most recently from af8c295 to 232e2da Compare June 1, 2026 10:49
@alex-hunt-materialize alex-hunt-materialize marked this pull request as ready for review June 1, 2026 12:21
@alex-hunt-materialize alex-hunt-materialize requested review from a team and ggevay as code owners June 1, 2026 12:21

@def- def- left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you create a metric that isn't a valid Prometheus identifier we panic:

diff --git a/test/http-auth/mzcompose.py b/test/http-auth/mzcompose.py
index f533adfb6d..f5e05c4820 100644
--- a/test/http-auth/mzcompose.py
+++ b/test/http-auth/mzcompose.py
@@ -7,6 +7,10 @@
 # the Business Source License, use of this software will be governed
 # by the Apache License, Version 2.0.

+import time
+from dataclasses import dataclass
+
+import psycopg
 import requests

 from materialize import MZ_ROOT
@@ -17,6 +21,62 @@ SERVICES = [
     Materialized(),
 ]

+# The canonical no-auth CI listener config enables the `endpoint_api` route
+# (`/metrics/custom`) on the external HTTP listener (6876), like CI/orchestratord.
+NO_AUTH_LISTENERS = f"{MZ_ROOT}/src/materialized/ci/listener_configs/no_auth.json"
+
+
+@dataclass
+class MetricTrigger:
+    case: str  # test-case suffix + unique object-name seed
+    view_body: str  # the view's non-value columns become the metric's labels
+    metric_ident: str  # becomes the Prometheus metric name verbatim
+    help_text: str  # the `HELP '...'` literal
+    why: str
+
+
+# `CREATE METRIC` accepts each of these, but the resulting Prometheus `Desc` is
+# invalid, so `MetricsRegistry::register`'s `.expect("defining a gauge vec")`
+# (src/ore/src/metrics.rs) panics on the next scrape. All views expose `count`.
+METRIC_TRIGGERS = [
+    MetricTrigger(
+        "invalid_metric_name",
+        "SELECT 1::int8 AS count",
+        '"http-requests"',
+        "h",
+        "a hyphen in the metric name",
+    ),
+    MetricTrigger(
+        "invalid_label_name",
+        "SELECT 1::int8 AS count, 'ok'::text AS \"2xx\"",
+        "responses",
+        "h",
+        "a label name starting with a digit (2xx)",
+    ),
+    MetricTrigger(
+        "empty_help",
+        "SELECT 1::int8 AS count",
+        "leads",
+        "",
+        "empty HELP text",
+    ),
+]
+
+
+def environmentd_alive(c: Composition, timeout: float = 20.0) -> bool:
+    """Whether environmentd answers `SELECT 1`. On the buggy path it aborted and,
+    with no restart policy, stays down — so this exhausts `timeout`."""
+    deadline = time.monotonic() + timeout
+    while True:
+        try:
+            if c.sql_query("SELECT 1", reuse_connection=False)[0][0] == 1:
+                return True
+        except Exception:
+            pass
+        if time.monotonic() >= deadline:
+            return False
+        time.sleep(1)
+

 def workflow_default(c: Composition) -> None:
     with c.override(
@@ -98,3 +158,65 @@ def workflow_default(c: Composition) -> None:
                 "permission denied"
                 in r.json()["results"][0]["error"]["message"].lower()
             )
+
+    # `handle_metrics_custom` registers a Prometheus gauge vector per metric via
+    # `MetricsRegistry::register`, which panics — rather than 500s — on an invalid
+    # `Desc` (see METRIC_TRIGGERS). environmentd's enhanced panic handler turns
+    # that into `process::abort()`, so one scrape takes the whole process down,
+    # and re-aborts on every later scrape since the bad metric lives in the
+    # catalog. The `no_auth` listener exposes `endpoint_api` like CI/orchestratord.
+    # Invariant (fix-agnostic): such a scrape must not abort environmentd — a
+    # plan-time rejection or a graceful HTTP error are both fine.
+    with c.override(Materialized(listeners_config_path=NO_AUTH_LISTENERS)):
+        for trig in METRIC_TRIGGERS:
+            with c.test_case(f"metrics_custom_{trig.case}"):
+                # Boot, or recover from the previous case's abort (env boots fine
+                # with a bad metric in the catalog; it only crashes on scrape).
+                c.up("materialized")
+                assert environmentd_alive(c), f"environmentd down before {trig.case}"
+
+                # RBAC is irrelevant to the panic; disabling it lets the anonymous
+                # HTTP user reach registration without grants. Re-applied (with
+                # idempotent drops) so each case is independent of a crash+restart.
+                c.sql(
+                    "ALTER SYSTEM SET enable_prometheus_metrics_api = true;"
+                    " ALTER SYSTEM SET enable_rbac_checks = false;",
+                    user="mz_system",
+                    port=6877,
+                    print_statement=False,
+                )
+                view, api = f"v_{trig.case}", f"api_{trig.case}"
+                c.sql(
+                    f"DROP API IF EXISTS materialize.public.{api} CASCADE;"
+                    f" DROP VIEW IF EXISTS materialize.public.{view} CASCADE;"
+                    f" CREATE VIEW materialize.public.{view} AS {trig.view_body};"
+                    f" CREATE API materialize.public.{api} FORMAT PROMETHEUS;",
+                    print_statement=False,
+                )
+                try:
+                    c.sql(
+                        f"CREATE METRIC materialize.public.{trig.metric_ident}"
+                        f" IN API materialize.public.{api} AS (TYPE 'gauge',"
+                        f" HELP '{trig.help_text}',"
+                        f" VALUES FROM materialize.public.{view},"
+                        f" VALUE COLUMN 'count')",
+                        print_statement=False,
+                    )
+                except psycopg.Error as e:
+                    # Rejected at plan time — the dangerous object never exists.
+                    print(f"[{trig.case}] rejected at CREATE METRIC ({trig.why}): {e}")
+                    continue
+
+                # On the buggy path the abort resets the connection mid-request.
+                url = (
+                    f"http://localhost:{c.port('materialized', 6876)}"
+                    f"/metrics/custom/materialize/public/{api}"
+                )
+                try:
+                    requests.get(url, timeout=30)
+                except requests.exceptions.RequestException:
+                    pass
+                assert environmentd_alive(c), (
+                    f"environmentd aborted after scraping a metric with {trig.why}"
+                    " — handle_metrics_custom panicked in MetricsRegistry::register"
+                )

Fails with:

thread 'tokio:work-6' panicked at src/ore/src/metrics.rs:321:41:
defining a gauge vec: Msg("'http-requests' is not a valid metric name")
   5: core::panicking::panic_fmt
   6: core::result::unwrap_failed
   7: <prometheus::vec::MetricVec<prometheus::gauge::GaugeVecBuilder<prometheus::atomic64::AtomicF64>> as mz_ore::metrics::MakeCollector>::make_collector
   8: <mz_ore::metrics::MetricsRegistry>::register::<prometheus::vec::MetricVec<prometheus::gauge::GaugeVecBuilder<prometheus::atomic64::AtomicF64>>>
   9: <alloc::collections::btree::map::entry::Entry<alloc::string::String, prometheus::vec::MetricVec<prometheus::gauge::GaugeVecBuilder<prometheus::atomic64::AtomicF64>>>>::or_insert_with::<mz_environmentd::http::sql::handle_metrics_custom::{closure#0}::{closure#4}>
  10: <mz_environmentd::http::sql::handle_metrics_custom as axum::handler::Handler<(axum_core::extract::private::ViaParts, mz_environmentd::http::AuthedClient, axum::extract::path::Path<(alloc::string::String, alloc::string::String, alloc::string::String)>), ()>>::call::{closure#0}
  11: <futures_util::future::future::map::Map<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = http::response::Response<axum_core::body::Body>> + core::marker::Send>>, fn(http::response::Response<axum_core::body::Body>) -> core::result::Result<http::response::Response<axum_core::body::Body>, core::convert::Infallible>> as core::future::future::Future>::poll
  12: <axum::util::MapIntoResponseFuture<axum::handler::future::IntoServiceFuture<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = http::response::Response<axum_core::body::Body>> + core::marker::Send>>>> as core::future::future::Future>::poll
  13: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
  14: <axum::util::MapIntoResponseFuture<axum::routing::route::RouteFuture<core::convert::Infallible>> as core::future::future::Future>::poll
  15: <axum::middleware::from_fn::Next>::run::{closure#0}
  16: mz_environmentd::http::http_auth::{closure#0}
  17: <axum::middleware::from_fn::FromFn<<mz_environmentd::http::HttpServer>::new::{closure#0}, (), axum::routing::route::Route, (http::request::Request<axum_core::body::Body>,)> as tower_service::Service<http::request::Request<axum_core::body::Body>>>::call::{closure#0}
  18: <axum::middleware::from_fn::ResponseFuture as core::future::future::Future>::poll
  19: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::middleware::from_fn::ResponseFuture>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  20: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::middleware::from_fn::ResponseFuture, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  21: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
  22: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  23: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::routing::route::RouteFuture<core::convert::Infallible>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  24: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
  25: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  26: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::routing::route::RouteFuture<core::convert::Infallible>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  27: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
  28: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  29: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::routing::route::RouteFuture<core::convert::Infallible>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  30: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
  31: <mz_environmentd::http::metrics::PrometheusFuture<axum::routing::route::RouteFuture<core::convert::Infallible>> as core::future::future::Future>::poll
  32: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<mz_environmentd::http::metrics::PrometheusFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  33: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<mz_environmentd::http::metrics::PrometheusFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  34: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
  35: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<axum::routing::route::RouteFuture<core::convert::Infallible>>, futures_util::fns::MapErrFn<<core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  36: <axum::util::MapIntoResponseFuture<tower::util::map_err::MapErrFuture<axum::routing::route::RouteFuture<core::convert::Infallible>, <core::convert::Infallible as core::convert::Into<core::convert::Infallible>>::into>> as core::future::future::Future>::poll
  37: <tower::util::oneshot::Oneshot<tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>, http::response::Response<axum_core::body::Body>, core::convert::Infallible>, http::request::Request<axum_core::body::Body>> as core::future::future::Future>::poll
  38: <hyper::proto::h1::dispatch::Dispatcher<hyper::proto::h1::dispatch::Server<hyper::service::util::ServiceFn<<mz_environmentd::http::HttpServer as mz_server_core::Server>::handle_connection<tokio_metrics::task::TaskIntervals>::{closure#0}::{closure#1}, hyper::body::incoming::Incoming>, hyper::body::incoming::Incoming>, axum_core::body::Body, hyper_openssl::client::legacy::MaybeHttpsStream<hyper_util::rt::tokio::TokioIo<mz_server_core::Connection>>, hyper::proto::h1::role::Server>>::poll_catch
  39: <hyper::server::conn::http1::UpgradeableConnection<hyper_openssl::client::legacy::MaybeHttpsStream<hyper_util::rt::tokio::TokioIo<mz_server_core::Connection>>, hyper::service::util::ServiceFn<<mz_environmentd::http::HttpServer as mz_server_core::Server>::handle_connection<tokio_metrics::task::TaskIntervals>::{closure#0}::{closure#1}, hyper::body::incoming::Incoming>> as core::future::future::Future>::poll
  40: <futures_util::future::future::map::Map<futures_util::future::try_future::into_future::IntoFuture<hyper::server::conn::http1::UpgradeableConnection<hyper_openssl::client::legacy::MaybeHttpsStream<hyper_util::rt::tokio::TokioIo<mz_server_core::Connection>>, hyper::service::util::ServiceFn<<mz_environmentd::http::HttpServer as mz_server_core::Server>::handle_connection<tokio_metrics::task::TaskIntervals>::{closure#0}::{closure#1}, hyper::body::incoming::Incoming>>>, futures_util::fns::MapErrFn<futures_util::fns::IntoFn<anyhow::Error>>> as core::future::future::Future>::poll
  41: <mz_environmentd::http::HttpServer as mz_server_core::Server>::handle_connection::<tokio_metrics::task::TaskIntervals>::{closure#0}
  42: <tokio_metrics::task::Instrumented<mz_server_core::serve<mz_environmentd::http::HttpServer, core::pin::Pin<alloc::boxed::Box<dyn mz_server_core::ConnectionStream<Item = core::result::Result<tokio::net::tcp::stream::TcpStream, std::io::error::Error>>>>>::{closure#0}::{closure#3}> as core::future::future::Future>::poll
  43: <tracing::instrument::Instrumented<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = ()> + core::marker::Send>>> as core::future::future::Future>::poll
  44: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = ()> + core::marker::Send>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::poll
  45: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future<Output = ()> + core::marker::Send>>>, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::poll
  46: <tokio::runtime::scheduler::multi_thread::worker::Context>::run_task
  47: <tokio::runtime::scheduler::multi_thread::worker::Context>::run
  48: tokio::runtime::context::runtime::enter_runtime::<tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}, ()>
  49: tokio::runtime::scheduler::multi_thread::worker::run
  50: <tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>> as core::future::future::Future>::poll
  51: <tokio::runtime::task::core::Core<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll
  52: <tokio::runtime::task::harness::Harness<tracing::instrument::Instrumented<tokio::runtime::blocking::task::BlockingTask<<tokio::runtime::scheduler::multi_thread::worker::Launch>::launch::{closure#0}>>, tokio::runtime::blocking::schedule::BlockingSchedule>>::poll
  53: <tokio::runtime::task::UnownedTask<tokio::runtime::blocking::schedule::BlockingSchedule>>::run
  54: <tokio::runtime::blocking::pool::Inner>::run

@alex-hunt-materialize alex-hunt-materialize marked this pull request as draft June 2, 2026 13:07
@alex-hunt-materialize alex-hunt-materialize force-pushed the prometheus_sql_metrics branch 4 times, most recently from 558f4df to 004d0f8 Compare June 2, 2026 15:11
@alex-hunt-materialize alex-hunt-materialize marked this pull request as ready for review June 2, 2026 17:14
alex-hunt-materialize and others added 6 commits June 3, 2026 13:19
Add grammar, AST nodes, and parser test coverage for the new
Prometheus-metrics DDL:

* CREATE API <name> FORMAT PROMETHEUS [IN CLUSTER <c>]
* DROP API <name>
* CREATE METRIC <name> IN API <api> AS (TYPE, HELP, VALUES FROM, VALUE COLUMN)
* DROP METRIC <name>
* GRANT/REVOKE ... ON API <name>

Adds the Api, Help, Listener, Metric, and Prometheus keywords. No
planning or execution yet; statements parse but are not accepted by the
planner.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Introduce Api and Metric as first-class catalog items so CREATE API /
CREATE METRIC can persist:

* New CatalogItem::Api / CatalogItem::Metric in-memory objects, durable
  protobuf objects (objects_v86), and the v85->v86 upgrade migration.
* mz_catalog builtin entries (mz_apis, mz_metrics) and OID allocations.
* parse_catalog_create_sql learns to render CREATE API / CREATE METRIC
  back to JSON (stamping cluster_id / api_id) for catalog introspection.
* audit-log event types for the new items.

This only defines the catalog plumbing; nothing creates these items yet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the planning and execution path for the new statements:

* Plan/Sequence: plan_create_api allocates the qualified name, resolves
  the cluster (stamping the resolved cluster_id into the AST so persisted
  SQL rehydrates after ALTER CLUSTER RENAME). plan_create_metric resolves
  the IN API target (must be a CatalogItemType::Api), validates VALUES FROM
  is a relation containing a numeric VALUE COLUMN, and enforces TYPE = gauge
  for now. The sequencer routes both into transact.
* Name resolution: the Aug pass resolves "IN API <api>" and "VALUES FROM
  <view>" to CatalogItemIds, so the planner relies on resolved targets and
  the dependency tracker records Metric -> {Api, View} edges.
* RBAC: CREATE API requires USAGE on the cluster (not CREATE); CREATE METRIC
  requires USAGE on the API and SELECT on the VALUES FROM relation. GRANT/
  REVOKE ... ON API is wired through acl planning. DROP follows standard
  item RBAC.
* Feature flag: both statements are gated on enable_prometheus_metrics_api
  (LD-controlled), off by default.
* sqllogictest coverage for parsing, planning errors, RBAC, dependency
  enforcement, and round-trip.

Also handles the new ExecuteResponse::Created{Api,Metric} variants in the
HTTP SQL result match so environmentd still compiles; the metrics endpoint
that serves these objects lands separately.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A metric's name, HELP text, and label names (its relation's non-value
columns) are handed verbatim to Prometheus at scrape time, where
`MetricsRegistry::register` *panics* on an invalid identifier or empty
HELP. Validate at CREATE METRIC time instead, by constructing a real
`prometheus::Desc` so the rules can never drift from what `register`
accepts.

A column added to a table that feeds a metric becomes a new label, so
ALTER TABLE ADD COLUMN revalidates the new column name the same way
(a freshly added column is always a label, never the pre-existing value
column). MV apply-replacement needs no such check: replacements are
required to be schema-identical to their target, so a metric's labels and
value column can't change underneath it; a comment notes this for any
future schema-evolution work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the HTTP endpoint that exposes APIs created with CREATE API:

* GET /metrics/custom/{database}/{schema}/{name} resolves the API from the
  catalog, peeks every metric's VALUES FROM relation on the API's cluster,
  and renders the Prometheus text exposition format.
* The route is gated per-listener by a new endpoint_api flag on
  HttpRoutesEnabled, wired through orchestratord, the test harness, the
  sqllogictest runner, and the no_auth listener config.
* Scrape-time RBAC: each request runs as the authenticated role and must
  hold USAGE on the API and SELECT on each metric relation. Enforcement
  reuses plan-path gating via the new rbac::is_rbac_enforced_for_session
  helper, so the endpoint's checks match CREATE-time validation exactly.
* environmentd server tests cover happy path, empty exposures, and the
  USAGE-on-API / SELECT-on-relation authorization paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`handle_metrics_custom` registers a Prometheus gauge vector per metric;
plain `register` panics on an invalid `Desc` (an invalid metric or label
name, empty HELP) and on a failed registry insertion. CREATE METRIC and
ALTER TABLE now reject names/labels/HELP that Prometheus would refuse, but
a metric's relation can still change shape underneath it (e.g. a source
schema change), so the scrape path must not panic on a bad metric.

Add a fallible `MetricsRegistry::try_register` (backed by a new
`MakeCollector::try_make_collector`) that returns the underlying
`prometheus::Error` instead of panicking, and use it in
`handle_metrics_custom` to skip the offending metric with a warning rather
than aborting the whole scrape. `register` now delegates to `try_register`
and keeps its panic-on-invalid contract for statically-defined metrics.

Adds an http-auth regression test asserting that scraping such a metric
never takes environmentd down.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alex-hunt-materialize

Copy link
Copy Markdown
Contributor Author

@def- Thanks for catching that. The fix is split over an additional commit in each PR, since some is SQL and some is serving.

@justjake

justjake commented Jun 8, 2026

Copy link
Copy Markdown

We'd love to have this at Notion Labs, Inc

}
};
metrics.push(ResolvedMetric {
name: entry.name().item.clone(),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend namespacing these a little bit.
Alloy (as an example of another thing that supports custom metrics) puts theirs behind loki_process_custom_. I think we should do a similar mz_custom_ prefix.

// for the first Err and bubble it up.
for r in results {
if let SqlResult::Err { error, .. } = r {
let status = if error.code == "42501" {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a more maintainable way to handle this check?

Namespace user-defined metrics in the Prometheus exposition so they
cannot collide with Materialize's built-in mz_* metrics. The prefix is
injected at scrape time in handle_metrics_custom; the user-supplied name
is unchanged in the catalog and mz_metrics.

No extra validation needed: the prometheus crate validates names by
charset only (no length limit), and mz_custom_ is all valid characters,
so a valid user name stays valid once prefixed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants