kumahq · lahabana · Apr 1, 2026 · slonka · Apr 7, 2026 · slonka
@@ -0,0 +1,245 @@
+# Resource status ownership and syncing rules in multizone
+
+* Status: accepted
+
+Technical Story: none
+
+## Context and Problem Statement
+
+Across multiple MADRs (039, 044, 051, 056, 096) we have established conventions
+for how resource `status` is handled in multizone deployments. These conventions
+are scattered and implicit. This MADR consolidates the rules into a single
+reference and strengthens them with an explicit prohibition on computing status
+on Global CP.
+
+The core tension is:
+
+1. Resources are authored on one control plane (zone or global) and synced to
+   others via KDS.
+2. Status contains information that is inherently local to the zone where the
+   resource is consumed (VIPs, hostnames, proxy counts, availability).
+3. Syncing status cross-zone would overwrite locally-computed values, causing
+   traffic interruptions and inconsistencies.
+4. Computing status on global would be costly and scale in O(number of entities globally)
+   which we try to avoid.
+
+Previous MADRs addressed this:
+
+| MADR | Decision |
+|------|----------|
+| 039 (MeshService API) | Status holds VIPs, hostnames, proxy counts, availability. Managed by CP. |
+| 051 (MeshService multizone) | Status is NOT synced cross-zone. Each zone computes its own. |
+| 056 (Identity sync) | Identity placed in `spec` (not `status`) to avoid partial status syncing. Motivated the single-writer model for spec vs status. |
+| 044 (Zone-to-global policy sync) | Zone-originated policies sync to global for visibility only. |
+| 096 (Ingress address sync) | `MeshZoneAddress` is status-less; address is in `spec`. |
+
+## Design
+
+### Rule 1: Status MUST be computed on zone CPs only
+
+Status fields reflect zone-local state: VIPs allocated from the zone's CIDR
+range, hostnames generated by the zone's `HostnameGenerator`, proxy counts from
+dataplanes running in that zone, availability derived from local dataplane
+health.
+
+**Global CP MUST NOT compute or populate status fields on any resource.** Global
+CP does not have the zone-local context required to produce correct values.
+
+This applies to all resource types, including:
+- `MeshService`
+- `MeshMultiZoneService`
+- `MeshExternalService`
+- Any future resource with a `status` sub-resource
+
+### Rule 2: Status MUST NOT be synced cross-zone
+
+When a resource is synced from zone A to global and then to zone B, the status
+from zone A MUST be stripped. Zone B computes its own status.
+
+Status **does** flow from zone to global and is stored there for visibility
+purposes (e.g., the GUI can display per-zone service status). This is analogous
+to how zone-originated policies sync to global for visibility (MADR 044). The
+status stored on global is always scoped to the originating zone — it is never
+merged across zones.
+
+However, when global syncs that resource onward to other zones, the status is
+stripped by the `RemoveStatus()` mapper (`pkg/kds/context/context.go`), which is
+applied as a blanket transformation to all resources with `HasStatus: true` sent
+from global to zones. On the receiving zone, the `IgnoreStatusChange` sync option
+(`pkg/kds/v2/store/sync.go`) ensures that locally-computed status is preserved
+and not overwritten by the empty status arriving from global.
+
+**Implication: global-originated resources have no status on global.** Since
+status is only computed on zones (Rule 1) and only zone-originated resources are
+synced from zone to global with their status (as described above),
+global-originated resources (e.g., `MeshMultiZoneService`) will not have status
+on global — there is simply no one computing it there. Zone-originated resources
+(e.g., `MeshService`) do have status on global because it arrives with the
+resource during zone→global sync. To access status of global-originated
+resources, use the computed API approaches described in Rule 5.
+
+### Rule 3: Single-writer model for spec and status
+
+Each resource follows a single control plane writer model — for any given
+resource instance, `spec` and `status` are each written by exactly one CP:
+
+| Field | Writer | Example |
+|-------|--------|---------|
+| `spec` | The originating control plane (zone or global) | Zone creates MeshService spec; global creates MeshMultiZoneService spec |
+| `status` | The consuming zone CP | Each zone computes VIPs, hostnames, proxy stats independently |
+
+"Single writer" here means single CP deployment, not single component. Within a zone CP,
+multiple components may write to different parts of status (e.g., the VIP
+allocator writes `status.VIPs`, the hostname generator writes
+`status.Addresses`, the status updater writes `status.TLS` and
+`status.DataplaneProxies`). These are coordinated via optimistic concurrency
+(conflict retries). The key invariant is that no two CP deployments write to the same
+resource instance's status.
+
+Note that computed spec fields like `Spec.Identities` and `Spec.State` (see
+Rule 4) are also written by the local zone's status updater, not by the user.
+This is consistent with the single-CP-writer model: for a local MeshService, the
+originating zone writes both user-authored and computed spec fields.
+
+This separation eliminates race conditions between spec updates from the origin
+and status updates on the destination (the scenario described in MADR 056).
+
+### Rule 4: Data that must cross zone boundaries belongs in spec
+
+If a piece of information needs to be available on zones other than where it was
+produced, it MUST be placed in `spec`, not `status`.
+
+This means some `spec` fields are not user-authored but computed by the local
+zone CP. These are sometimes called "computed spec fields." They live in `spec`
+because they need to cross zone boundaries via normal KDS sync, but they are
+written by the zone's status updater rather than by the user. The single-CP-
+writer model (Rule 3) still holds: these fields are written by the same CP that
+originates the resource.
+
+Precedents:
+- **Identity** (MADR 056): `MeshService.Spec.Identities` rather than
+  `MeshService.Status.Identities` because other zones need this for mTLS SAN
+  verification. Computed by the zone status updater from matched Dataplane
+  proxies and MeshIdentity resources.
+- **State**: `MeshService.Spec.State` carries availability information computed
+  by the zone status updater so that other zones and `MeshMultiZoneService` can
+  use it for routing decisions (e.g., excluding zones with no healthy endpoints).
+- **Ingress address** (MADR 096): `MeshZoneAddress` carries the address in
+  `spec` (the resource is status-less).
+
+### Rule 5: Global-level visibility of zone status requires a dedicated computed API
+
+If there is a need to view or aggregate zone-specific status at the global
+level (e.g., for a GUI dashboard showing cross-zone health), this MUST be
+served by a dedicated read-only computed API endpoint that recomputes the
+information on demand rather than storing it.
+
+Note: zone-originated resources arrive on global with their status intact (see
+Rule 2). Passively storing this zone-synced status is expected — it enables
+visibility via the global API. The prohibitions below concern global CP
+**independently** computing or aggregating status:
+
+Global CP MUST NOT:
+- Independently compute or populate status fields on resources
- Independently compute or populate status fields on resources
+- Independently compute or populate status fields on resources (passively storing zone-synced data via KDS is fine)
- Independently compute or populate status fields on resources
+- Independently compute or populate status fields on resources (passively storing zone-synced data via KDS is fine)
+- Merge status from multiple zones into a single resource's status
+- Compute a "global status" by aggregating synced data into stored resources
+
+There are two approaches for exposing zone status at the global level, both
+valid depending on the use case:
+
+#### Approach A: Recompute on global from synced resources
+
+When global CP already has enough synced data (specs, labels, zone-originated
+resources) to derive the answer, the endpoint recomputes the result at request
+time.
+
+- Follow the MADR 072 convention: computed endpoints are prefixed with `_`
+- Clearly indicate which zone each entry originates from
+- Treated as an eventually-consistent view
+
+**Existing example: the `_hostnames` endpoint**
+
+`GET /meshes/{mesh}/{serviceType}/{name}/_hostnames` (where `serviceType` is
+`meshservices`, `meshexternalservices`, or `meshmultizoneservices`) computes
+hostnames on demand. It fetches all `HostnameGenerator` resources, evaluates
+their Go templates against the requested service's metadata and labels, and
+returns the generated hostnames with their zone associations. On Global CP it
+tests both zone and global origin perspectives to capture all possible matches.
+The endpoint does not store computed hostnames on global — results are
+recomputed per request.
+
+#### Approach B: Forward the request to the zone CP via KDS RPC
+
+When the data can only be produced by the zone CP itself (e.g., it requires
+access to zone-local state that is not synced to global), the global CP
+forwards the request to the appropriate zone over the existing KDS bi-directional
+stream using the reverse unary RPC mechanism (see MADR 014).
+
+The flow is:
+1. User sends a REST request to global CP
+2. Global CP looks up `ZoneInsight` to find which global CP instance holds
+   the KDS stream for the target zone
+3. If the stream is on the local instance, global CP sends a request message
+   over the KDS stream and waits for the response (matched by request ID)
+4. If the stream is on another global CP instance, the request is forwarded
+   via the inter-CP gRPC service (`InterCPEnvoyAdminForwardService`), which
+   then sends it over KDS
+5. The zone CP processes the request locally and sends the response back
+   through the stream
+
+**Existing example: Envoy admin data (XDS config dump, stats, clusters)**
+
+The Envoy admin inspection endpoints forward requests to the zone CP that owns
+the dataplane. The zone CP connects to the local `kuma-dp` proxy, retrieves the
+Envoy admin data, and returns it through the KDS stream. This is implemented via
+`GlobalKDSService.StreamXDSConfigs` / `StreamStats` / `StreamClusters` (defined
+in `kds.proto`) with the reverse unary RPC pattern in
+`pkg/util/grpc/reverse_unary_rpcs.go`.
+
+#### Choosing between approaches
+
+| | Approach A (recompute on global) | Approach B (KDS RPC to zone) |
+|---|---|---|
+| **Use when** | Global has sufficient synced data to derive the answer | Data is only available on the zone CP |
+| **Latency** | Lower — no cross-CP round trip | Higher — requires KDS stream round trip |
+| **Availability** | Works even if zone is disconnected (stale but available) | Fails if zone is disconnected |
+| **Complexity** | Lower — standard API handler | Higher — requires stream management, inter-CP forwarding |
+
+Both approaches keep the global store free of zone-local state and avoid the
+consistency issues that come with trying to keep aggregated status up to date.
+
+#### Pre-existing exception: Insight resources
+
+`MeshInsight` and `ServiceInsight` are legacy resources that predate these rules.
+They are computed and stored by the resyncer (`pkg/insights/resyncer.go`) on
+global CP and non-federated zone CPs, aggregating dataplane statistics across
+the mesh. These resources have `HasStatus: false` (they are spec-only, read-only)
+and are not status sub-resources in the sense of this MADR.
+
+New features MUST NOT follow this pattern. The Insight resources are acknowledged
+as a pre-existing exception, not a precedent.
+
+## Security implications and review
+
+No new security implications. These rules reduce the risk of status data from
+a compromised zone propagating to other zones.
+
+## Reliability implications
+
+Enforcing zone-local status computation improves reliability:
+- No cross-zone status overwrites that could cause VIP/hostname loss
+- No dependency on global CP for status computation
+- Zone can operate autonomously for status even when disconnected from global
+
+## Implications for Kong Mesh
+
+None. These rules apply uniformly to all deployments.
+
+## Decision
+
+Status is always zone-specific and MUST only be computed on zone CPs.
+Global CP MUST NOT compute status. Cross-zone status visibility requires a
+dedicated API. Data that needs to travel across zones belongs in `spec`.
+
+These rules consolidate and strengthen existing conventions from MADRs 014, 039,
+044, 051, 056, 072, and 096.