OCPBUGS-81761: Cloud ClusterHostedDNS Load Balancer IPs need to returned in a predictable order#5870
OCPBUGS-81761: Cloud ClusterHostedDNS Load Balancer IPs need to returned in a predictable order#5870sadasu wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@sadasu: This pull request references Jira Issue OCPBUGS-81761, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdded deterministic sorting of load-balancer IP slices by cloning and ordering IP lists before return; introduced a unit test that verifies stable ordering across GCP, AWS, and Azure using varied input permutations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/jira refresh |
|
@sadasu: This pull request references Jira Issue OCPBUGS-81761, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/controller/template/render.go (1)
823-829: Consider deduplicating the clone+sort logic into one helper.Lines 823-853 repeat the same transformation three times; a shared helper would reduce drift risk.
♻️ Suggested refactor
+func cloudPlatformSortedLoadBalancerIPs(cfg RenderConfig, lbType LoadBalancerType) (interface{}, error) { + ips, err := cloudPlatformLoadBalancerIPs(cfg, lbType) + if err != nil { + return nil, err + } + ipsClone := slices.Clone(ips.([]configv1.IP)) + slices.Sort(ipsClone) + return ipsClone, nil +} + func cloudPlatformAPIIntLoadBalancerIPs(cfg RenderConfig) (interface{}, error) { - ips, err := cloudPlatformLoadBalancerIPs(cfg, apiIntLB) - if err != nil { - return nil, err - } - ipsClone := slices.Clone(ips.([]configv1.IP)) - slices.Sort(ipsClone) - return ipsClone, nil + return cloudPlatformSortedLoadBalancerIPs(cfg, apiIntLB) } func cloudPlatformAPILoadBalancerIPs(cfg RenderConfig) (interface{}, error) { - ips, err := cloudPlatformLoadBalancerIPs(cfg, apiLB) - if err != nil { - return nil, err - } - ipsClone := slices.Clone(ips.([]configv1.IP)) - slices.Sort(ipsClone) - return ipsClone, nil + return cloudPlatformSortedLoadBalancerIPs(cfg, apiLB) } func cloudPlatformIngressLoadBalancerIPs(cfg RenderConfig) (interface{}, error) { - ips, err := cloudPlatformLoadBalancerIPs(cfg, ingressLB) - if err != nil { - return nil, err - } - ipsClone := slices.Clone(ips.([]configv1.IP)) - slices.Sort(ipsClone) - return ipsClone, nil + return cloudPlatformSortedLoadBalancerIPs(cfg, ingressLB) }Also applies to: 835-841, 847-853
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/template/render.go` around lines 823 - 829, The clone+sort pattern used after cloudPlatformLoadBalancerIPs (taking ips, err, doing slices.Clone then slices.Sort and returning) should be extracted into a small helper (e.g., normalizeIPs or cloneAndSortIPs) to avoid repeating the same logic in render.go; implement a helper that accepts the returned ips type ([]configv1.IP), performs the slice clone and sort, and returns the sorted []configv1.IP, then replace the three inline blocks (the code using ips, ipsClone and slices.Sort) to call that helper from where cloudPlatformLoadBalancerIPs is used.pkg/controller/template/cloudplatform_lb_test.go (1)
87-89: Harden the fixture against slice aliasing side effects.Lines 87-89 reuse the same backing slice for all three LB fields; cloning per field makes the test more robust against accidental in-place mutation in future changes.
🧪 Suggested test hardening
+import "slices" ... - lbConfig.ClusterHosted.APIIntLoadBalancerIPs = ips - lbConfig.ClusterHosted.APILoadBalancerIPs = ips - lbConfig.ClusterHosted.IngressLoadBalancerIPs = ips + lbConfig.ClusterHosted.APIIntLoadBalancerIPs = slices.Clone(ips) + lbConfig.ClusterHosted.APILoadBalancerIPs = slices.Clone(ips) + lbConfig.ClusterHosted.IngressLoadBalancerIPs = slices.Clone(ips)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/template/cloudplatform_lb_test.go` around lines 87 - 89, The test currently assigns the same backing slice to three fields (lbConfig.ClusterHosted.APIIntLoadBalancerIPs, APILoadBalancerIPs, IngressLoadBalancerIPs) which risks aliasing; change each assignment to use a cloned slice (e.g., append([]string(nil), ips...) or make+copy) so each field gets its own backing array, preventing in-place mutations of one field from affecting the others during the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/controller/template/cloudplatform_lb_test.go`:
- Around line 87-89: The test currently assigns the same backing slice to three
fields (lbConfig.ClusterHosted.APIIntLoadBalancerIPs, APILoadBalancerIPs,
IngressLoadBalancerIPs) which risks aliasing; change each assignment to use a
cloned slice (e.g., append([]string(nil), ips...) or make+copy) so each field
gets its own backing array, preventing in-place mutations of one field from
affecting the others during the tests.
In `@pkg/controller/template/render.go`:
- Around line 823-829: The clone+sort pattern used after
cloudPlatformLoadBalancerIPs (taking ips, err, doing slices.Clone then
slices.Sort and returning) should be extracted into a small helper (e.g.,
normalizeIPs or cloneAndSortIPs) to avoid repeating the same logic in render.go;
implement a helper that accepts the returned ips type ([]configv1.IP), performs
the slice clone and sort, and returns the sorted []configv1.IP, then replace the
three inline blocks (the code using ips, ipsClone and slices.Sort) to call that
helper from where cloudPlatformLoadBalancerIPs is used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: efa5ff8c-f5db-4ecb-9897-297ac10052d6
📒 Files selected for processing (2)
pkg/controller/template/cloudplatform_lb_test.gopkg/controller/template/render.go
Add a test that checks to see if there are multiple LB IPs, they are returned in a predictable order so that there is no machine-config controller churn.
ccf282d to
886031d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/controller/template/render.go (1)
820-825: Strengthen type-safety by removinginterface{}from IP sorting path.This helper currently depends on an unchecked assertion and can panic if the upstream return type ever drifts. Prefer typed
[]configv1.IPend-to-end.Proposed refactor
-func cloudPlatformLoadBalancerIPs(cfg RenderConfig, lbType LoadBalancerType) (interface{}, error) { +func cloudPlatformLoadBalancerIPs(cfg RenderConfig, lbType LoadBalancerType) ([]configv1.IP, error) { -func cloneAndSortIPs(ips interface{}) []configv1.IP { - ipsSorted := slices.Clone(ips.([]configv1.IP)) +func cloneAndSortIPs(ips []configv1.IP) []configv1.IP { + ipsSorted := slices.Clone(ips) slices.Sort(ipsSorted) return ipsSorted }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/template/render.go` around lines 820 - 825, The cloneAndSortIPs helper currently accepts an interface{} and does an unchecked type assertion which can panic; change its signature to accept a typed []configv1.IP (func cloneAndSortIPs(ips []configv1.IP) []configv1.IP), remove the assertion inside, perform slices.Clone and slices.Sort directly on the typed slice, and then update all callers to pass a []configv1.IP rather than interface{} (adjust call sites that currently pass generic values or forward upstream returns to ensure they return the concrete []configv1.IP type).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/controller/template/render.go`:
- Around line 820-825: The cloneAndSortIPs helper currently accepts an
interface{} and does an unchecked type assertion which can panic; change its
signature to accept a typed []configv1.IP (func cloneAndSortIPs(ips
[]configv1.IP) []configv1.IP), remove the assertion inside, perform slices.Clone
and slices.Sort directly on the typed slice, and then update all callers to pass
a []configv1.IP rather than interface{} (adjust call sites that currently pass
generic values or forward upstream returns to ensure they return the concrete
[]configv1.IP type).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 95704a49-5dfe-44e7-8182-508b79a0c65a
📒 Files selected for processing (2)
pkg/controller/template/cloudplatform_lb_test.gopkg/controller/template/render.go
886031d to
33aeb6e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/controller/template/render.go (1)
830-855: Prefer typed helper returns over unchecked assertions.
ips.([]configv1.IP)can panic ifcloudPlatformLoadBalancerIPsever changes concrete return type. Returning[]configv1.IPdirectly from the helper makes this path compile-time safe.Proposed refactor
-func cloudPlatformLoadBalancerIPs(cfg RenderConfig, lbType LoadBalancerType) (interface{}, error) { +func cloudPlatformLoadBalancerIPs(cfg RenderConfig, lbType LoadBalancerType) ([]configv1.IP, error) { ... } func cloudPlatformAPIIntLoadBalancerIPs(cfg RenderConfig) (interface{}, error) { ips, err := cloudPlatformLoadBalancerIPs(cfg, apiIntLB) if err != nil { return nil, err } - return cloneAndSortIPs(ips.([]configv1.IP)), nil + return cloneAndSortIPs(ips), nil } func cloudPlatformAPILoadBalancerIPs(cfg RenderConfig) (interface{}, error) { ips, err := cloudPlatformLoadBalancerIPs(cfg, apiLB) if err != nil { return nil, err } - return cloneAndSortIPs(ips.([]configv1.IP)), nil + return cloneAndSortIPs(ips), nil } func cloudPlatformIngressLoadBalancerIPs(cfg RenderConfig) (interface{}, error) { ips, err := cloudPlatformLoadBalancerIPs(cfg, ingressLB) if err != nil { return nil, err } - return cloneAndSortIPs(ips.([]configv1.IP)), nil + return cloneAndSortIPs(ips), nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/template/render.go` around lines 830 - 855, The helper cloudPlatformLoadBalancerIPs should return a concrete type ([]configv1.IP, error) instead of interface{} so callers don't need unchecked assertions; change cloudPlatformLoadBalancerIPs signature and its internal returns to ([]configv1.IP, error), update callers cloudPlatformAPILoadBalancerIPs, cloudPlatformIngressLoadBalancerIPs and the other caller that passes apiIntLB/apiLB/ingressLB to accept that typed return, and remove the ips.([]configv1.IP) assertions—pass the typed slice directly into cloneAndSortIPs (which expects []configv1.IP) and propagate errors as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/controller/template/render.go`:
- Around line 830-855: The helper cloudPlatformLoadBalancerIPs should return a
concrete type ([]configv1.IP, error) instead of interface{} so callers don't
need unchecked assertions; change cloudPlatformLoadBalancerIPs signature and its
internal returns to ([]configv1.IP, error), update callers
cloudPlatformAPILoadBalancerIPs, cloudPlatformIngressLoadBalancerIPs and the
other caller that passes apiIntLB/apiLB/ingressLB to accept that typed return,
and remove the ips.([]configv1.IP) assertions—pass the typed slice directly into
cloneAndSortIPs (which expects []configv1.IP) and propagate errors as before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 83597f20-3d4b-4705-ab5a-8cfcbbeeeb11
📒 Files selected for processing (1)
pkg/controller/template/render.go
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-custom-dns-techpreview |
|
@sadasu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b103b3c0-3e7d-11f1-8694-80806987c267-0 |
|
/test unit |
|
/test e2e-aws-ovn e2e-aws-ovn-upgrade e2e-gcp-op-part1 e2e-gcp-op-part2 e2e-gcp-op-single-node e2e-hypershift |
Return sorted list of Cloud Load Balancer IPs so that changes in order do not result in the CoreDNS pod yaml to be regenerated and cause machine config churn. Note that this error occurs only when the ClusterHostedDNS feature is enabled and not during regular AWS installs.
33aeb6e to
fb3a965
Compare
|
/test e2e-aws-ovn e2e-aws-ovn-upgrade e2e-gcp-op-part1 e2e-gcp-op-part2 e2e-gcp-op-single-node e2e-hypershift |
|
Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sadasu, tthvo The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-custom-dns-techpreview |
|
@sadasu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4c40d2a0-3e92-11f1-8111-9fc7e68a182d-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-custom-dns-techpreview |
|
@sadasu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f261e3a0-3eaa-11f1-8fa0-6a065099d827-0 |
|
Install-config contained multiple zones: Infrastructure resource contained multiple LB IPs. ClusterHosted Load Balancer IPs: The master nodes were rebooted 5 times based on the boot IDs in the kubelet service logs:
And only one update to CoreDNS.yaml was made hence proving that sorting the IPs before rendering them ensured that IP addresses were not returned in random order. |
|
/verified by CI See #5870 (comment) and #5870 (comment) |
|
@sadasu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
/test e2e-hypershift |
|
@sadasu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@isabella-janssen , @cheesesashimi Could you PTAL? |
|
|
||
| // cloneAndSortIPs clones and sorts IP addresses to ensure consistent ordering. | ||
| func cloneAndSortIPs(ips []configv1.IP) []configv1.IP { | ||
| ipsSorted := slices.Clone(ips) |
There was a problem hiding this comment.
Are we 100% sure about this change? I mean, order in DNS is important and discarding the original order and sort them numerically totally drops DNS NSs priority.
There was a problem hiding this comment.
I agree with this thought. In fact, the tests that I proposed in #5840 ensure that the MCO writes the configs while preserving the order in which the input was given.
Fixes: https://redhat.atlassian.net/browse/OCPBUGS-81761
With the current behavior, when there are multiple Cloud LB IPs for API, APIInt or Ingress, the IPs are returned in random order. This caused the CoreDNS pod template to be regenerated every time, causing machine config controller churn.
Added a fix to make sure the LB IPs are sorted before being used to generate the CoreDNS pod from its template. Also added an unit test to verify this behavior.
Summary by CodeRabbit
Bug Fixes
Tests