docs(MADR): add MADR 101 for status syncing rules in multizone#16083
docs(MADR): add MADR 101 for status syncing rules in multizone#16083lahabana wants to merge 1 commit intokumahq:masterfrom
Conversation
Consolidate conventions scattered from MADRs 014, 039, 044, 051, 056, 072, and 096 into explicit rules for resource status ownership and syncing in multizone deployments. Signed-off-by: Charly Molter <charly.molter@konghq.com>
There was a problem hiding this comment.
Pull request overview
Adds MADR 101 to consolidate and make explicit the project’s multizone conventions around resource status ownership, computation, and KDS syncing, so future resource/API work follows a consistent single-writer model and avoids cross-zone status overwrites.
Changes:
- Introduces a unified set of rules: status is zone-local, computed only on zone CPs, and not synced onward cross-zone.
- Documents the existing implementation mechanics that enforce this (global→zone status stripping + zone-side ignore/preserve behavior).
- Clarifies how to provide global-level visibility without storing/aggregating global-computed status (computed endpoints vs KDS reverse-unary RPC forwarding).
Reviewer Checklist🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
|
| 4. Computing status on global would be costly and scale in O(number of entities globally) | ||
| which we try to avoid. | ||
|
|
||
| Previous MADRs addressed this: |
There was a problem hiding this comment.
I think the list above and below is missing 014 and 072.
| **independently** computing or aggregating status: | ||
|
|
||
| Global CP MUST NOT: | ||
| - Independently compute or populate status fields on resources |
There was a problem hiding this comment.
| - Independently compute or populate status fields on resources | |
| - Independently compute or populate status fields on resources (passively storing zone-synced data via KDS is fine) |
| dedicated API. Data that needs to travel across zones belongs in `spec`. | ||
|
|
||
| These rules consolidate and strengthen existing conventions from MADRs 014, 039, | ||
| 044, 051, 056, 072, and 096. |
There was a problem hiding this comment.
What does enforcement look like? The guards already exist in code but they're ad-hoc per component. Is code review sufficient or should there be a systematic check?
| 1. Resources are authored on one control plane (zone or global) and synced to | ||
| others via KDS. | ||
| 2. Status contains information that is inherently local to the zone where the |
There was a problem hiding this comment.
These examples (VIPs, hostnames, proxy counts, availability) seem MeshService-specific - what about the other resources with HasStatus: true (MeshIdentity, MeshOpenTelemetryBackend, Workload, MeshTrust, MeshTrace, MeshMetric, MeshAccessLog, HostnameGenerator)? Should the framing be broader to cover those cases or is this MADR only about Mesh*Service?
| Across multiple MADRs (039, 044, 051, 056, 096) we have established conventions | ||
| for how resource `status` is handled in multizone deployments. These conventions | ||
| are scattered and implicit. This MADR consolidates the rules into a single | ||
| reference and strengthens them with an explicit prohibition on computing status | ||
| on Global CP. |
There was a problem hiding this comment.
The context says these conventions are "scattered and implicit" - are there implicit ones in the code that aren't captured by any of the referenced MADRs?
Looking at the codebase:
- Every status updater and VIP allocator is gated by
rt.GetMode() == config_core.Zone(andpkg/ipam/components.go). IsLocalMeshService()filtering in the status updater
Is it worth mentioning these as existing enforcement mechanisms, or at least acknowledging them so future contributors know the guards already exist?
| (`pkg/kds/v2/store/sync.go`) ensures that locally-computed status is preserved | ||
| and not overwritten by the empty status arriving from global. | ||
|
|
||
| **Implication: global-originated resources have no status on global.** Since |
There was a problem hiding this comment.
Does it make sense to compute it for some resources? For MeshService-style status (VIPs, hostnames) the answer is clear - global lacks the zone-local inputs.
Is the same true for MeshOpenTelemetryBackend, MeshIdentity etc.?
Was this considered?
| resource instance's status. | ||
|
|
||
| Note that computed spec fields like `Spec.Identities` and `Spec.State` (see | ||
| Rule 4) are also written by the local zone's status updater, not by the user. |
There was a problem hiding this comment.
A bit related: should we consider rename since it's no longer "status updater"? computed fields reconciler?
| This separation eliminates race conditions between spec updates from the origin | ||
| and status updates on the destination (the scenario described in MADR 056). | ||
|
|
||
| ### Rule 4: Data that must cross zone boundaries belongs in spec |
There was a problem hiding this comment.
Rule 4 formalizes "computed spec fields" as a concept but there's no structural way to distinguish them from user-authored spec fields. Spec.Identities and Spec.State sit next to user-authored fields. Should we come up with a naming pattern for this?
|
|
||
| #### Choosing between approaches | ||
|
|
||
| | | Approach A (recompute on global) | Approach B (KDS RPC to zone) | |
There was a problem hiding this comment.
I think I'm leaning towards A
Consolidate conventions scattered from MADRs 014, 039, 044, 051, 056, 072, and 096 into explicit rules for resource status ownership and syncing in multizone deployments.