You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This proposal addresses issue #3164 (Support GPUs) by suggesting a lightweight implementation approach that learns from previous rejected PRs.
History of GPU Support Attempts
Several PRs have attempted to add GPU support but were not merged:
PR add gpus support in docker provider #1886 (2021): Added Docker-specific --gpus flag. Rejected because it exposed runtime implementation details and wasn't portable across providers.
PR Add GPU support #3257 (2023): Similar Docker-focused approach. Closed by author after feedback recommending CDI instead.
Heavy dependency tree (CDI library + transitive deps)
Concerns about maintaining kind's "minimal dependencies" philosophy
Merge conflicts after extended discussion period
Key Maintainer Concerns from Previous PRs
Runtime implementation leakage: Exposing that kind uses docker run with specific flags
Provider portability: Solutions must work across Docker, Podman, and nerdctl
Dependency minimalism: Avoid adding heavy dependency trees
CRI alignment: Preference for approaches aligned with Kubernetes CRI patterns
Proposed Solution
Add a simple, provider-agnostic gpus field to the node configuration that maps to provider-specific runtime flags internally.
User Configuration (Provider-Agnostic)
kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:
- role: workergpus: "all"# Start with "all", extensible to specific GPUs later
Internal Provider Mapping
The gpus field would be translated to provider-specific flags:
Docker: --gpus all
Podman: --device nvidia.com/gpu=all (CDI syntax, no library dependency)
nerdctl: --gpus all
Implementation Scope
Add GPUs string field to Node struct in config API
Validate that only "all" is supported initially
Implement flag generation in all three providers (docker/podman/nerdctl)
Run make generate for deepcopy methods
Add unit tests for validation and flag generation
Estimated changes: ~100 lines across 6 files, no new dependencies.
Why This Approach
Advantages
Cross-provider: Works on Docker, Podman, and nerdctl from day one
Zero dependencies: Simple string matching, no CDI library needed
Follows existing patterns: Mirrors extraMounts and extraPortMappings
Extensible: String field allows future support like "0,1" for specific GPUs
Minimal code: Small, maintainable change
Acknowledged Limitations
Still exposes runtime details: Like extraPortMappings, this reveals we use container runtime CLI
Not fully CRI-aligned: Kubernetes uses device plugins, not runtime flags
Runtime-specific behavior: Each provider may handle GPUs slightly differently
The Core Question for Maintainers
Is a cross-provider, zero-dependency GPU field acceptable despite exposing runtime implementation details?
The alternative (full CDI library integration) was discussed in PR #3290 but raised dependency concerns. Given that:
Issue Support GPUs #3164 shows GPU support is desired (milestone v0.20.0)
The "pure" approach (CDI library) adds unwanted dependencies
Users are currently blocked on GPU workloads in kind
Would you accept a pragmatic solution that follows the extraMounts/extraPortMappings pattern, or would you prefer to wait for a more architecturally pure approach even if it means heavier dependencies?
Open Questions
Is the proposed API (gpus: "all") acceptable, or would you prefer a different structure?
Should this be per-node or cluster-wide configuration?
Are there concerns with the provider-specific flag mapping approach?
Would you want runtime version detection to provide better error messages?
Generic extraArgs field: Would expose too many runtime internals
Cluster-wide GPU setting: Less flexible than per-node configuration
DeviceRequests API: Would require runtime-specific API calls, not exec-based
I'd appreciate feedback on whether this approach addresses your concerns from the previous GPU PRs, or if there are architectural issues that would prevent merging regardless of implementation quality.
GPU Support Proposal for kind
Context
This proposal addresses issue #3164 (Support GPUs) by suggesting a lightweight implementation approach that learns from previous rejected PRs.
History of GPU Support Attempts
Several PRs have attempted to add GPU support but were not merged:
--gpusflag. Rejected because it exposed runtime implementation details and wasn't portable across providers.Key Maintainer Concerns from Previous PRs
docker runwith specific flagsProposed Solution
Add a simple, provider-agnostic
gpusfield to the node configuration that maps to provider-specific runtime flags internally.User Configuration (Provider-Agnostic)
Internal Provider Mapping
The
gpusfield would be translated to provider-specific flags:--gpus all--device nvidia.com/gpu=all(CDI syntax, no library dependency)--gpus allImplementation Scope
GPUs stringfield toNodestruct in config API"all"is supported initiallymake generatefor deepcopy methodsEstimated changes: ~100 lines across 6 files, no new dependencies.
Why This Approach
Advantages
extraMountsandextraPortMappings"0,1"for specific GPUsAcknowledged Limitations
extraPortMappings, this reveals we use container runtime CLIThe Core Question for Maintainers
Is a cross-provider, zero-dependency GPU field acceptable despite exposing runtime implementation details?
The alternative (full CDI library integration) was discussed in PR #3290 but raised dependency concerns. Given that:
Would you accept a pragmatic solution that follows the
extraMounts/extraPortMappingspattern, or would you prefer to wait for a more architecturally pure approach even if it means heavier dependencies?Open Questions
gpus: "all") acceptable, or would you prefer a different structure?Alternative Approaches Considered
extraArgsfield: Would expose too many runtime internalsI'd appreciate feedback on whether this approach addresses your concerns from the previous GPU PRs, or if there are architectural issues that would prevent merging regardless of implementation quality.
References:
@BenTheElder @aojea