Skip to content

Refactor cloud provider builder into a dynamic registration pattern#9639

Open
Choraden wants to merge 1 commit into
kubernetes:masterfrom
Choraden:register-pattern-for-providers
Open

Refactor cloud provider builder into a dynamic registration pattern#9639
Choraden wants to merge 1 commit into
kubernetes:masterfrom
Choraden:register-pattern-for-providers

Conversation

@Choraden
Copy link
Copy Markdown
Contributor

@Choraden Choraden commented May 14, 2026

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This comprehensive refactoring transitions the Cluster Autoscaler's cloud provider initialization from a hardcoded, monolithic switch statement to a decoupled, dynamic registration pattern.

Motivation:
Previously, the core cloud provider builder maintained direct dependencies on every supported cloud provider implementation. This architectural coupling forced the Cluster Autoscaler to pull in a massive number of transitive dependencies for all providers (AWS, Azure, GCE, etc.) regardless of which one was actually used. This resulted in significant "dependency bloat," unnecessarily large binary sizes, and long build times. It was particularly burdensome for external forks or specialized deployments that only required a single provider.

Impact:
This change significantly improves the maintainability and extensibility of the Cluster Autoscaler. It paves the way for a more modular architecture where cloud providers can be treated as optional plugins, reducing the core's dependency footprint and making it easier for the community to contribute and maintain provider-specific logic.

One of the most important impacts of this refactor is for the health of Cluster Autoscaler forks. Previously, any project importing k8s.io/autoscaler/cluster-autoscaler was forced to inherit the massive transitive dependency tree of every single cloud provider (AWS, Azure, etc.) in their go.mod. With this change, we can finally import the core CA packages directly without that bloat. For example, in our GCE CA fork, we can now keep a clean go.mod that only contains the dependencies we actually use. This decoupling is a huge win for the maintainability of all external distributions and significantly reduces our exposure to upstream dependency issues

I believe this invasive refactoring is essential for the health of the project and it would be beneficial to all maintainers of CA forks. Also it reduces the attack surface and makes the derivative projects more resilient, given the rising risk and popularity of supply chain attacks.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Key Architectural Improvements:

  1. Centralized Registration: The cloud provider initialization logic is now centralized in the cloudprovider/builder package. Each cloud provider implementation is responsible for registering its own builder function via an init() block.

  2. True Decoupling: The core builder no longer has any direct knowledge or compile-time dependencies on specific provider implementations. It interacts solely with a registry of builder functions.

  3. Dynamic Provider Discovery: AvailableCloudProviders() is now a dynamic function that returns only the providers that have been registered in the current binary. This ensures that CLI help text (--help) and flag validation accurately reflect the capabilities of the specific build.

  4. Configurable Default Provider: The DefaultCloudProvider is no longer a hardcoded constant. It can now be set dynamically via the registry, allowing custom builds to define their own default provider without modifying core code.

  5. Modular Build Support: The set of supported providers in a binary is now entirely controlled by blank imports (e.g., in main.go). While standard builds continue to include all providers, this pattern enables the easy creation of optimized, provider-specific binaries.

Does this PR introduce a user-facing change?

Cluster Autoscaler has transitioned its cloud provider initialization to a dynamic registration pattern. This architectural shift decouples the core logic from specific provider implementations, addressing long-standing "dependency bloat" and improving overall maintainability.

No configuration changes are required for standard deployments. Developers creating custom Cluster Autoscaler binaries can now manage included providers via blank imports in their main package.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 14, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation labels May 14, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Choraden
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave and removed do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. labels May 14, 2026
@k8s-ci-robot k8s-ci-robot added area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/exoscale size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher labels May 14, 2026
@k8s-ci-robot k8s-ci-robot added the area/provider/utho Issues or PRs related to Utho provider label May 14, 2026
@Choraden Choraden force-pushed the register-pattern-for-providers branch from 70fc12e to b34a32a Compare May 14, 2026 14:14
@mtrqq
Copy link
Copy Markdown
Contributor

mtrqq commented May 14, 2026

I like the pattern, it's largely used for libraries which want to avoid importing implementation-specific packages like golang SQL. Considering that in the long term we'll be moving to library form of autoscaler repository - it makes sense to start doing such steps right now

One thing which bothers me is the binary size inflation as we are removing support for build tags, can you show the difference in binary sizes if we pick one provider (e.g. GCE) and compare it against to build from current revision?

@Choraden
Copy link
Copy Markdown
Contributor Author

One thing which bothers me is the binary size inflation as we are removing support for build tags, can you show the difference in binary sizes if we pick one provider (e.g. GCE) and compare it against to build from current revision?

Good catch. The functionality of builds behind tags can be easily restored by mimicking the previous builder_providerX files structure but in the main package. Each file will be registering particular provider. For Example:

//go:build gce
	
import _ "k8s.io/autoscaler/cluster-autoscaler/cloudprovider/gce"

@mtrqq
Copy link
Copy Markdown
Contributor

mtrqq commented May 15, 2026

One thing which bothers me is the binary size inflation as we are removing support for build tags, can you show the difference in binary sizes if we pick one provider (e.g. GCE) and compare it against to build from current revision?

Good catch. The functionality of builds behind tags can be easily restored by mimicking the previous builder_providerX files structure but in the main package. Each file will be registering particular provider. For Example:

//go:build gce
	
import _ "k8s.io/autoscaler/cluster-autoscaler/cloudprovider/gce"

That's reasonable, but it will require us creating a golang module per build tag in the root of cluster-autoscaler/ or moving main.go to cmd as per golang conventions, let's refrain from doing that right now. If this mechanism is still getting used by someone - this can be easily restored in follow-up, I wouldn't consider it a breaking change.

Do you agree @jackfrancis @towca @BigDarkClown?

Copy link
Copy Markdown
Contributor

@BigDarkClown BigDarkClown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the pattern, it feels much better than the previous one.

When it comes to tags, I am okay with the current solution. Note that while the binary size will increase for provider-specific images, the main one will stay the same. If somebody already goes through the trouble of building a per-provider image, they are likely doing that in a fork already, and can adjust with minimal input.

return utho.BuildUtho(opts, do, rl)
case cloudprovider.CoreWeaveProviderName:
return coreweave.BuildCoreWeave(opts, do, rl)
if builder, ok := GetCloudProviderBuilder(opts.CloudProviderName); ok {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems redundant after the changes, could we just roll its body inside NewCloudProvider()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

Comment thread cluster-autoscaler/cloudprovider/kubemark/kubemark_other.go
Comment thread cluster-autoscaler/main.go Outdated
// The registration pattern allows for customizing the set of supported cloud providers
// by including or excluding these blank imports. This is particularly useful for
// external forks that want to avoid unnecessary dependencies.
_ "k8s.io/autoscaler/cluster-autoscaler/cloudprovider/alicloud"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we could preserve the tag-specific behavior pretty easily:

  • Create a new pkg with the same structure as the cloudprovider/builder pkg before these changes:
    • builder_<provider>.go file for each cloud provider, with the appropriate provider-specific build tag. This file would just contain the blank import for the appropriate provider.
    • builder_all.go file with the negative tags, only included if none of the provider-specific tags are used. This file would contain blank imports for all the providers, like here.
    • Maybe instead of having the blank imports which are pretty vague on their own, we actually put the whole init() functions in the new pkg instead of directly in cloudprovider/<provider>? IMO it'd make a lot of sense, we could name the new pkg something like cloudprovider/router?
  • Have main.go import this new pkg instead of directly doing the blank imports. This way we still have the benefits of NewCloudProvider/NewAutoscaler not depending on any cloud provider, but we also keep the ability to create a provider-specific binary without forking, using build tags.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a router package that would consolidate all the blank imports and allow for tag based builds.
I couldn't put init() func there as those import the cloudproviders itself.
Forks are advised to use the desired cloudprovider package directly, bypassing the router.

@Choraden Choraden force-pushed the register-pattern-for-providers branch 2 times, most recently from 2edd8ab to 7f0872d Compare May 25, 2026 14:54
This comprehensive refactoring transitions the Cluster Autoscaler's cloud provider
initialization from a hardcoded, monolithic switch statement to a decoupled,
dynamic registration pattern.

Motivation:
Previously, the core cloud provider builder maintained direct dependencies on every
supported cloud provider implementation. This architectural coupling forced the
Cluster Autoscaler to pull in a massive number of transitive dependencies for
all providers (AWS, Azure, GCE, etc.) regardless of which one was actually used.
This resulted in significant "dependency bloat," unnecessarily large binary sizes,
and long build times. It was particularly burdensome for external forks or
specialized deployments that only required a single provider.

Key Architectural Improvements:

1. Centralized Registration: The cloud provider initialization logic is now
   centralized in the `cloudprovider/builder` package. Each cloud provider
   implementation is responsible for registering its own builder function via
   an `init()` block.

2. True Decoupling: The core builder no longer has any direct knowledge or
   compile-time dependencies on specific provider implementations. It interacts
   solely with a registry of builder functions.

3. Dynamic Provider Discovery: `AvailableCloudProviders()` is now a dynamic
   function that returns only the providers that have been registered in the
   current binary. This ensures that CLI help text (`--help`) and flag validation
   accurately reflect the capabilities of the specific build.

4. Configurable Default Provider: The `DefaultCloudProvider` is no longer a
   hardcoded constant. It can now be set dynamically via the registry, allowing
   custom builds to define their own default provider without modifying core code.

5. Modular Build Support: The set of supported providers in a binary is now
   entirely controlled by blank imports in cloudprovider/router. While standard builds
   continue to include all providers, this pattern enables the easy creation of
   optimized, provider-specific binaries.

Impact:
This change significantly improves the maintainability and extensibility of the
Cluster Autoscaler. It paves the way for a more modular architecture where
cloud providers can be treated as optional plugins, reducing the core's
dependency footprint and making it easier for the community to contribute and
maintain provider-specific logic.
@Choraden Choraden force-pushed the register-pattern-for-providers branch from 7f0872d to b97accd Compare May 25, 2026 15:00
@Choraden
Copy link
Copy Markdown
Contributor Author

/retest

@BigDarkClown
Copy link
Copy Markdown
Contributor

Looks good to me, @towca for approval.

limitations under the License.
*/

// Package router provides a centralized way to include cloud provider implementations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IMO a README.md would me more discoverable than a go file buried in a directory of 30 go files. IDEs would also show if by default when you click the directory etc.

// Cloud providers must be explicitly imported to be registered in the builder.
// The registration pattern allows for customizing the set of supported cloud providers
// by including or excluding these blank imports. This is particularly useful for
// external forks that want to avoid unnecessary dependencies.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd mention the provider-specific tags somehow in this comment.

}

// SetDefaultCloudProvider sets the default cloud provider name.
func SetDefaultCloudProvider(name string) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we lost setting the default provider in the flow with provider-specific tags. It seems like we could restore it pretty easily:

  • In the init() functions in cloudprovider/<provider_name>, add aSetDefaultCloudProvider() call. This preserves the previous behavior of not having to pass the cloud provider flag if you build with a provider-specific tag.
  • In router_all.go, add an init() function that calls SetDefaultCloudProvider(gce) to preserve the GCE default when you build with no tags.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/exoscale area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants