Skip to content

Proposal: Lightweight cross-provider GPU support #4122

@hjames9

Description

@hjames9

GPU Support Proposal for kind

Context

This proposal addresses issue #3164 (Support GPUs) by suggesting a lightweight implementation approach that learns from previous rejected PRs.

History of GPU Support Attempts

Several PRs have attempted to add GPU support but were not merged:

Key Maintainer Concerns from Previous PRs

  1. Runtime implementation leakage: Exposing that kind uses docker run with specific flags
  2. Provider portability: Solutions must work across Docker, Podman, and nerdctl
  3. Dependency minimalism: Avoid adding heavy dependency trees
  4. CRI alignment: Preference for approaches aligned with Kubernetes CRI patterns

Proposed Solution

Add a simple, provider-agnostic gpus field to the node configuration that maps to provider-specific runtime flags internally.

User Configuration (Provider-Agnostic)

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: worker
    gpus: "all"  # Start with "all", extensible to specific GPUs later

Internal Provider Mapping

The gpus field would be translated to provider-specific flags:

  • Docker: --gpus all
  • Podman: --device nvidia.com/gpu=all (CDI syntax, no library dependency)
  • nerdctl: --gpus all

Implementation Scope

  1. Add GPUs string field to Node struct in config API
  2. Validate that only "all" is supported initially
  3. Implement flag generation in all three providers (docker/podman/nerdctl)
  4. Run make generate for deepcopy methods
  5. Add unit tests for validation and flag generation

Estimated changes: ~100 lines across 6 files, no new dependencies.

Why This Approach

Advantages

  • Cross-provider: Works on Docker, Podman, and nerdctl from day one
  • Zero dependencies: Simple string matching, no CDI library needed
  • Follows existing patterns: Mirrors extraMounts and extraPortMappings
  • Extensible: String field allows future support like "0,1" for specific GPUs
  • Minimal code: Small, maintainable change

Acknowledged Limitations

  • Still exposes runtime details: Like extraPortMappings, this reveals we use container runtime CLI
  • Not fully CRI-aligned: Kubernetes uses device plugins, not runtime flags
  • Runtime-specific behavior: Each provider may handle GPUs slightly differently

The Core Question for Maintainers

Is a cross-provider, zero-dependency GPU field acceptable despite exposing runtime implementation details?

The alternative (full CDI library integration) was discussed in PR #3290 but raised dependency concerns. Given that:

  • Issue Support GPUs #3164 shows GPU support is desired (milestone v0.20.0)
  • The "pure" approach (CDI library) adds unwanted dependencies
  • Users are currently blocked on GPU workloads in kind

Would you accept a pragmatic solution that follows the extraMounts/extraPortMappings pattern, or would you prefer to wait for a more architecturally pure approach even if it means heavier dependencies?

Open Questions

  1. Is the proposed API (gpus: "all") acceptable, or would you prefer a different structure?
  2. Should this be per-node or cluster-wide configuration?
  3. Are there concerns with the provider-specific flag mapping approach?
  4. Would you want runtime version detection to provide better error messages?

Alternative Approaches Considered

  1. Full CDI library integration (PR Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290): Rejected due to dependencies
  2. Generic extraArgs field: Would expose too many runtime internals
  3. Cluster-wide GPU setting: Less flexible than per-node configuration
  4. DeviceRequests API: Would require runtime-specific API calls, not exec-based

I'd appreciate feedback on whether this approach addresses your concerns from the previous GPU PRs, or if there are architectural issues that would prevent merging regardless of implementation quality.

References:

@BenTheElder @aojea

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions