Skip to content

docs(MADR): add MADR 101 for status syncing rules in multizone#16083

Open
lahabana wants to merge 1 commit intokumahq:masterfrom
lahabana:madr-status-syncing
Open

docs(MADR): add MADR 101 for status syncing rules in multizone#16083
lahabana wants to merge 1 commit intokumahq:masterfrom
lahabana:madr-status-syncing

Conversation

@lahabana
Copy link
Copy Markdown
Contributor

@lahabana lahabana commented Apr 1, 2026

Consolidate conventions scattered from MADRs 014, 039, 044, 051, 056, 072, and 096 into explicit rules for resource status ownership and syncing in multizone deployments.

Consolidate conventions scattered from MADRs 014, 039, 044, 051, 056,
072, and 096 into explicit rules for resource status ownership and
syncing in multizone deployments.

Signed-off-by: Charly Molter <charly.molter@konghq.com>
Copilot AI review requested due to automatic review settings April 1, 2026 12:38
@lahabana lahabana requested a review from a team as a code owner April 1, 2026 12:38
@lahabana lahabana requested review from Automaat and lukidzi April 1, 2026 12:38
@lahabana lahabana changed the title docs(madr): add MADR 101 for status syncing rules in multizone docs(MADR): add MADR 101 for status syncing rules in multizone Apr 1, 2026
@lahabana lahabana added the ci/skip-test PR: Don't run unit and e2e tests (maybe this is just a doc change) label Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MADR 101 to consolidate and make explicit the project’s multizone conventions around resource status ownership, computation, and KDS syncing, so future resource/API work follows a consistent single-writer model and avoids cross-zone status overwrites.

Changes:

  • Introduces a unified set of rules: status is zone-local, computed only on zone CPs, and not synced onward cross-zone.
  • Documents the existing implementation mechanics that enforce this (global→zone status stripping + zone-side ignore/preserve behavior).
  • Clarifies how to provide global-level visibility without storing/aggregating global-computed status (computed endpoints vs KDS reverse-unary RPC forwarding).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Reviewer Checklist

🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
If something doesn't apply please check the box and add a justification if the reason is non obvious.

  • Is the PR title satisfactory? Is this part of a larger feature and should be grouped using > Changelog?
  • PR description is clear and complete. It Links to relevant issue as well as docs and UI issues
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as an image registry)
  • IPv6 is taken into account (.e.g: no string concatenation of host port)
  • Tests (Unit test, E2E tests, manual test on universal and k8s)
    • Don't forget ci/ labels to run additional/fewer tests
  • Does this contain a change that needs to be notified to users? In this case, UPGRADE.md should be updated.
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label)

4. Computing status on global would be costly and scale in O(number of entities globally)
which we try to avoid.

Previous MADRs addressed this:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the list above and below is missing 014 and 072.

**independently** computing or aggregating status:

Global CP MUST NOT:
- Independently compute or populate status fields on resources
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Independently compute or populate status fields on resources
- Independently compute or populate status fields on resources (passively storing zone-synced data via KDS is fine)

dedicated API. Data that needs to travel across zones belongs in `spec`.

These rules consolidate and strengthen existing conventions from MADRs 014, 039,
044, 051, 056, 072, and 096.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does enforcement look like? The guards already exist in code but they're ad-hoc per component. Is code review sufficient or should there be a systematic check?

Comment on lines +17 to +19
1. Resources are authored on one control plane (zone or global) and synced to
others via KDS.
2. Status contains information that is inherently local to the zone where the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples (VIPs, hostnames, proxy counts, availability) seem MeshService-specific - what about the other resources with HasStatus: true (MeshIdentity, MeshOpenTelemetryBackend, Workload, MeshTrust, MeshTrace, MeshMetric, MeshAccessLog, HostnameGenerator)? Should the framing be broader to cover those cases or is this MADR only about Mesh*Service?

Comment on lines +9 to +13
Across multiple MADRs (039, 044, 051, 056, 096) we have established conventions
for how resource `status` is handled in multizone deployments. These conventions
are scattered and implicit. This MADR consolidates the rules into a single
reference and strengthens them with an explicit prohibition on computing status
on Global CP.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context says these conventions are "scattered and implicit" - are there implicit ones in the code that aren't captured by any of the referenced MADRs?

Looking at the codebase:

  1. Every status updater and VIP allocator is gated by rt.GetMode() == config_core.Zone (and pkg/ipam/components.go).
  2. IsLocalMeshService() filtering in the status updater

Is it worth mentioning these as existing enforcement mechanisms, or at least acknowledging them so future contributors know the guards already exist?

(`pkg/kds/v2/store/sync.go`) ensures that locally-computed status is preserved
and not overwritten by the empty status arriving from global.

**Implication: global-originated resources have no status on global.** Since
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to compute it for some resources? For MeshService-style status (VIPs, hostnames) the answer is clear - global lacks the zone-local inputs.

Is the same true for MeshOpenTelemetryBackend, MeshIdentity etc.?

Was this considered?

resource instance's status.

Note that computed spec fields like `Spec.Identities` and `Spec.State` (see
Rule 4) are also written by the local zone's status updater, not by the user.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit related: should we consider rename since it's no longer "status updater"? computed fields reconciler?

This separation eliminates race conditions between spec updates from the origin
and status updates on the destination (the scenario described in MADR 056).

### Rule 4: Data that must cross zone boundaries belongs in spec
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule 4 formalizes "computed spec fields" as a concept but there's no structural way to distinguish them from user-authored spec fields. Spec.Identities and Spec.State sit next to user-authored fields. Should we come up with a naming pattern for this?


#### Choosing between approaches

| | Approach A (recompute on global) | Approach B (KDS RPC to zone) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm leaning towards A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/skip-test PR: Don't run unit and e2e tests (maybe this is just a doc change)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants