Skip to content

Doc: Design doc for unified per-object arrangement sizes#35884

Open
leedqin wants to merge 2 commits intoMaterializeInc:mainfrom
leedqin:unified-object-arrangement-size-design-doc
Open

Doc: Design doc for unified per-object arrangement sizes#35884
leedqin wants to merge 2 commits intoMaterializeInc:mainfrom
leedqin:unified-object-arrangement-size-design-doc

Conversation

@leedqin
Copy link
Copy Markdown
Contributor

@leedqin leedqin commented Apr 6, 2026

Introduces a design for two new system catalog objects in mz_internal: mz_object_arrangement_sizes (live, differential) and mz_object_arrangement_size_history (append-only, 7-day retention).

The live table uses the introspection subscribe pattern to aggregate per-object arrangement memory from all replicas without session variables. The history table follows the mz_storage_usage_by_shard collection and pruning pattern for time-range queries.

Part of CNS-42

Motivation

Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.

If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.

Description

What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.

Verification

How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@leedqin leedqin requested review from DAlperin, antiguru and teskje April 6, 2026 22:25
Comment on lines +100 to +102
The `cluster_id` is denormalized into the history table because replicas can be dropped
while their historical rows are still within the retention window. Without it, those
rows become unattributable to a cluster.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the cluster_id be used? Is it useful to have it without additional cluster information, like the name?

Also, we have mz_cluster_replica_history, which has both the cluster_id and the cluster_name for each replica that ever existed. Wouldn't that satisfy the need for attribution?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would! Updated my design doc to use the mz_cluster_replica_history

while their historical rows are still within the retention window. Without it, those
rows become unattributable to a cluster.

The history table follows the `mz_storage_usage_by_shard` collection and pruning pattern:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the current storage usage collection code is known to scale badly and there are plans to change it: #35436. Just pointing this out to make sure we follow the fixed implementation, not the old implementation.

Comment on lines +146 to +161
SELECT
ce.export_id AS object_id,
GREATEST(10485760, (SUM(raw.size) / 10485760 * 10485760))::int8 AS size
FROM mz_introspection.mz_compute_exports AS ce
JOIN mz_introspection.mz_dataflow_operator_dataflows AS dod
ON dod.dataflow_id = ce.dataflow_id
JOIN (
SELECT operator_id, COUNT(*) AS size
FROM (
SELECT operator_id FROM mz_introspection.mz_arrangement_heap_size_raw
UNION ALL
SELECT operator_id FROM mz_introspection.mz_arrangement_batcher_size_raw
) combined
GROUP BY operator_id
) AS raw ON raw.operator_id = dod.id
GROUP BY ce.export_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this will run on each replica, we should spend some time optimizing the plan as much as we can. Some things that come to mind:

  • mz_dataflow_operator_dataflows is a view with a join. We might be able to avoid some work by inlining it and removing parts we don't need.
  • Consider converting the subquery into a CTE, to give the optimizer an easier time.
  • I don't think we should need two GROUP BYs, one on the export_id should be enough.
  • Probably want to use query hints to limit memory usage.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should merge #35889 to fix some pathological optimizer behavior!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these changes! Thank you for the suggestions; The optimized plan goes from 5 index reads to 4 (eliminated mz_dataflow_operators_per_worker), 6-way join to 3-way, and 2 reduces to 1. Added OPTIONS (AGGREGATE INPUT GROUP SIZE = 1000) for memory bounds.

Comment on lines +170 to +175
The `GREATEST(10MB, rounded)` expression serves two purposes: the integer division
(`/ 10485760 * 10485760`) rounds to 10MB boundaries, suppressing byte-level differential
churn that would otherwise propagate on every minor allocation. The `GREATEST` sets a
floor so that objects under 10MB still appear as 10MB rather than rounding to zero and
disappearing from results. (The POC uses the quantization without the `GREATEST` floor;
adding the floor is part of the remaining work.)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding churn on tiny changes seems great to me! How was the 10MB chosen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just used a rough heuristic:

  • Too small (e.g., 1KB) → almost no churn reduction, defeats the purpose
  • Too large (e.g., 100MB) → hides real changes, objects under 100MB become invisible
  • 10MB felt like a reasonable middle ground — most meaningful memory consumers are well above 10MB, and typical allocation noise is well below it

Comment on lines +225 to +227
- **Dataflow memory overhead.** Subscribe adds a dataflow to every replica. Same joins
as `mz_dataflow_arrangement_sizes`, so overhead is known and observable via existing
per-dataflow introspection.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the "overhead is known and observable" part. mz_dataflow_arrangement_sizes is a view without an index, so it doesn't have overhead except when it's queried.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that view has no persistent overhead. Reworded to describe the subscribe's overhead directly instead.

Comment on lines +234 to +235
- **Stale replica cleanup.** Replicas dropped while environmentd is down leave orphaned
rows. Startup task compares against `mz_cluster_replicas` and retracts mismatches.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can replicas be dropped while envd is down?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I misunderstood that! Removed the live-table startup cleanup; it only prunes expired history rows now.

- **History table growth.** ~1.7M rows at scale (1000 objects × 10 replicas × 168
snapshots). Bounded by 7-day retention with startup pruning.
- **Silent feature disable.** No data if `ENABLE_INTROSPECTION_SUBSCRIBES = false`.
Repopulates automatically when re-enabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thing missing from the list: The hourly collection only sees snapshots, so (a) it misses short memory spikes and (b) is may miss entire replicas if they were created and dropped between two collections.

Comment on lines +290 to +292
**Reasons not chosen:** Arrangement sizes are not in compute's Prometheus registry. Adding
them would require compute-side changes to register per-object metrics. The introspection
subscribe derives the same data from existing log sources with no compute changes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that as part of @SangJunBak's metrics work we might want to add arrangement sizes to Prometheus metrics. At least arrangement sizes are part of the suggested customer-facing metrics and I wouldn't want us to run SQL queries on every scrape (especially not ones to compute introspection).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a note for a future solution that would be more efficient! Will keep an eye out to possibly migrate to using that for the maintained objects when that lands.

Comment on lines +296 to +298
1. **Naming.** The unified source is named `mz_object_arrangement_sizes` in `mz_internal`
schema. The existing view with the same name is in `mz_introspection` schema. Different
schemas, so no conflict, but could cause confusion. Should we use a different name?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is an existing mz_object_arrangement_sizes view!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is left over from my earlier draft that I was working with claude on this design doc based on my POC.

Comment on lines +300 to +302
2. **Collection interval tuning.** The 1-hour default balances history granularity against
table growth. Should this be shorter (e.g., 15 minutes) for environments that need
finer-grained trending?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should definitely be a dyncfg. We'll need that for testing anyway.

leedqin added 2 commits April 13, 2026 16:12
Introduces a design for two new system catalog objects in mz_internal:
mz_object_arrangement_sizes (live, differential) and
mz_object_arrangement_size_history (append-only, 7-day retention).

The live table uses the introspection subscribe pattern to aggregate
per-object arrangement memory from all replicas without session variables.
The history table follows the mz_storage_usage_by_shard collection and
pruning pattern for time-range queries.

Part of CNS-42.
-Remove cluster_id from history schema; use mz_cluster_replica_history join instead
-Optimize subscribe query: inline mz_dataflow_addresses_per_worker directly, single GROUP BY, CTE structure, AGGREGATE INPUT GROUP SIZE hint
-Clarify startup cleanup: only prune expired history rows; live table handled by deferred_write
-Rewrite pitfalls: dataflow overhead, staleness detection, hourly snapshot gaps, replica lifecycle
-Use dyncfgs for collection interval and retention period instead of system variables
-Fix stale cluster_id references in usage examples and validation plan
@leedqin leedqin force-pushed the unified-object-arrangement-size-design-doc branch from 69962d4 to 860374b Compare April 14, 2026 03:11
@leedqin leedqin requested a review from teskje April 14, 2026 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants