vpa/admission-controller: limit request payload size to 5MB#9690
vpa/admission-controller: limit request payload size to 5MB#9690sophieliu15 wants to merge 1 commit into
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
This issue is currently awaiting triage. If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sophieliu15 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @sophieliu15. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
daf6643 to
d8dd4c0
Compare
adrianmoisey
left a comment
There was a problem hiding this comment.
Heya,
Thanks for the PR.
I've got two small comments for you
| { | ||
| name: "Oversized payload exceeding 5MB limit", | ||
| isEndless: true, | ||
| expectedStatus: http.StatusOK, // Fails open, returning 200 OK with unmarshal error rather than crashing |
There was a problem hiding this comment.
Should we rather be expecting a 413 from the server here?
There was a problem hiding this comment.
Good question! I chose to let the read error "fail open" (proceeding with a 200 OK rather than a 413) for two reasons:
-
Preserving Webhook Behavior: This design represents a minimal change that strictly preserves the VPA webhook's existing, permissive "fail-open" philosophy for JSON parsing/unmarshaling errors (which defaults to Allowed: true). It guarantees the webhook is protected from OOM crashes without changing VPA's business logic or breaking backward compatibility.
-
Preventing Cluster Scheduling Outages Under DoS: Mutating webhooks are inline to all Pod creations/updates. If the webhook is under a heavy DoS flood and we return a 413 HTTP error:
- The API Server treats non-200 responses as webhook infrastructure failures.
- If the cluster operator has VPA configured with
failurePolicy: Fail, the API Server will immediately block and reject all legitimate user Pod scheduling requests. - By using
http.MaxBytesReadercombined with 200 OK, we discard oversized payloads with minimal resource consumption and do not report this as infrastructure failure to the API Server, ensuring minimal impact on cluster-wide scheduling under DoS.
There was a problem hiding this comment.
I'm not sure that's true. This changes the logic to respond with a 200, without the expected payload, see https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#response
api-server will treat this as a failure, no matter the HTTP code.
| var body []byte | ||
| if r.Body != nil { | ||
| if data, err := io.ReadAll(r.Body); err == nil { | ||
| if data, err := io.ReadAll(io.LimitReader(r.Body, maxAdmissionPayloadSize)); err == nil { |
There was a problem hiding this comment.
Does it make sense to rather use http.MaxBytesReader ? It seems to be purpose built for HTTP
There was a problem hiding this comment.
Great catch! I have updated the PR to use http.MaxBytesReader instead of io.LimitReader.
Add a defensive 5MB payload size cap using on the HTTP request body in the VPA Admission Controller webhook. This prevents arbitrary/maliciously large payload requests from exhausting memory resources and triggering Out-of-Memory (OOM) crashes (Denial of Service). The webhook continues to fail open on reading or unmarshaling errors to protect cluster scheduling availability.
d8dd4c0 to
273dd83
Compare
| if data, err := io.ReadAll(http.MaxBytesReader(w, r.Body, maxAdmissionPayloadSize)); err == nil { | ||
| body = data | ||
| } else { | ||
| klog.ErrorS(err, "Failed to read admission request body (payload may exceed 5MB limit)") |
There was a problem hiding this comment.
How can we be sure that the error here is because of that reason?
Also, printing out 5MB without a reference to the const value can cause errors.
I would expect something like this:
diff --git a/vertical-pod-autoscaler/pkg/admission-controller/logic/server.go b/vertical-pod-autoscaler/pkg/admission-controller/logic/server.go
index d4ee79e64..47c74d431 100644
--- a/vertical-pod-autoscaler/pkg/admission-controller/logic/server.go
+++ b/vertical-pod-autoscaler/pkg/admission-controller/logic/server.go
@@ -19,6 +19,7 @@ package logic
import (
"context"
"encoding/json"
+ "errors"
"fmt"
"io"
"net/http"
@@ -200,7 +201,12 @@ func (s *AdmissionServer) Serve(w http.ResponseWriter, r *http.Request) {
if data, err := io.ReadAll(http.MaxBytesReader(w, r.Body, maxAdmissionPayloadSize)); err == nil {
body = data
} else {
- klog.ErrorS(err, "Failed to read admission request body (payload may exceed 5MB limit)")
+ var maxBytesErr *http.MaxBytesError
+ if errors.As(err, &maxBytesErr) {
+ klog.ErrorS(err, "Admission request body exceeds size limit", "limit", maxAdmissionPayloadSize)
+ } else {
+ klog.ErrorS(err, "Failed to read admission request body")
+ }
}
}
| } | ||
|
|
||
| func TestServePayloadLimit(t *testing.T) { | ||
| tests := []struct { |
There was a problem hiding this comment.
Can we add more tests here for my above comment?
Add a defensive 5MB payload size cap using on the HTTP request body in the VPA Admission Controller webhook.
This prevents arbitrary/maliciously large payload requests from exhausting memory resources and triggering Out-of-Memory (OOM) crashes (Denial of Service). The webhook continues to fail open on reading or unmarshaling errors to protect cluster scheduling availability.
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR adds a defensive 5MB payload size limit to the Vertical Pod Autoscaler (VPA) admission controller webhook server.
Prior to this change, the webhook handler used io.ReadAll(r.Body) to read the incoming HTTP request body into memory without bounds checking. A maliciously large payload (e.g. an endless stream of data) sent to the webhook endpoint could consume all memory resources and trigger an Out-of-Memory (OOM) crash. Since the admission controller is a critical component for scheduling, its crash could lead to a Denial of Service (DoS) blocking all new pod creation in the cluster if configured to fail closed.
By wrapping r.Body with io.LimitReader(r.Body, 5MB), we safely cap the maximum memory allocation during the read step. If the payload is truncated or fails parsing, the webhook gracefully respects its permissive "fail-open" contract to allow scheduling to proceed, while protecting its own server process from OOM.
Special notes for your reviewer:
The 5MB limit was selected to comfortably accommodate valid massive Kubernetes update requests (which contain both the object and oldObject payloads, each up to etcd's default 1.5MB limit plus metadata overhead) while remaining well below standard container memory limits to eliminate OOM risk.
Unit tests have been added to server_test.go to cover both normal and oversized payload scenarios.
Does this PR introduce a user-facing change?
NONE