Block server-side apply upsert of existing objects on create#4270
Block server-side apply upsert of existing objects on create#4270guineveresaenger wants to merge 12 commits intomasterfrom
Conversation
Add a pre-flight GET before server-side apply during resource creation. If the target object already exists in the cluster, return an error directing users to aliases (for renames), `pulumi import`, or Patch resources, instead of silently updating the existing object. This prevents the dangerous create-then-delete pattern where renaming a Pulumi resource with the same metadata.name causes silent data loss (#2948), and the confusing "stuck in state" scenario when targeting protected resources like kube-system (#2926). The check runs on both preview and up, and is skipped for Patch resources (which are designed for existing objects). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the API server forbids deletion (e.g. protected namespaces like kube-system) and server-side apply is enabled, fall back to relinquishing managed fields instead of failing. This cleans up fields Pulumi added while leaving the protected resource intact, and allows the resource to be removed from Pulumi state without manual state surgery. This is a safety net for cases where a protected resource ends up in state via `pulumi import`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For users who rely on the upsert behavior (e.g. bulk-adopting existing cluster resources into Pulumi), add a provider config flag upsertExistingObjects that skips the pre-create existence check. Can be set via: - Provider config: upsertExistingObjects: true - Environment variable: PULUMI_K8S_UPSERT_EXISTING_OBJECTS=true Defaults to false — the safe behavior where creating a resource that already exists in the cluster returns an error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests against a live cluster that creating the built-in "default" namespace fails with an "already exists" error, preventing the silent upsert behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Does the PR have any schema changes?Looking good! No breaking changes found. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4270 +/- ##
==========================================
+ Coverage 45.74% 45.77% +0.02%
==========================================
Files 93 93
Lines 10363 10404 +41
==========================================
+ Hits 4741 4762 +21
- Misses 5211 5229 +18
- Partials 411 413 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The "already exists" error also fires during create-before-delete replacements of explicitly-named resources, where the same data loss pattern applies. Mention deleteBeforeReplace as a resolution option. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/review-again |
There was a problem hiding this comment.
Two blocking issues found in provider/pkg/await/await.go:
-
Silent bypass of the existence guard on non-NotFound GET errors (line 208-211): Any GET error that isn't
404(e.g. a403from a service account with write-but-not-read RBAC, a transient503) is logged atV(1)and ignored, allowing the SSA upsert to proceed. This directly undermines the PR's goal — a routine RBAC split where the SA haspatchbut notgetmeans the guard is bypassed on every create, reproducing the data-loss scenario of #2926. -
Relinquish fires on any 403, not just protected resources (line 984-989): The
IsForbidden && ServerSideApplycondition catches all403responses — including a service account that simply has nodeleteRBAC permission. Any SSA stack where delete is denied for any reason will silently have its managed fields relinquished and the resource removed from state rather than surfacing the permission error. This is a silent state-corruption risk that affects all SSA users, not just those usingupsertExistingObjects.
Reviewed by Internal PR Re-Review (Slash Command)
To install this agentic workflow, run
gh aw add pulumi-labs/gh-aw-internal/.github/workflows/gh-aw-pr-rereview.md@734ef41746387a6818fd8ac3e619c9fd81ac6957
provider/pkg/await/await.go
Outdated
| if !apierrors.IsNotFound(getErr) { | ||
| logger.V(1).Infof("pre-create GET for %s/%s failed: %v", | ||
| c.Inputs.GetNamespace(), name, getErr) | ||
| } |
There was a problem hiding this comment.
Blocking: Non-404 GET errors silently bypass the existence guard
When the pre-create GET returns an error that is not NotFound (e.g. a 403 Forbidden because the service account has patch but not get, a 503, or a transient timeout), the code logs at V(1) and falls through — allowing the SSA Patch to proceed as if the object does not exist.
This directly undermines the PR's stated goal: a common RBAC configuration where the service account has write-but-not-read permission means every create silently bypasses the guard. If the cluster object actually exists in that scenario, the upsert happens undetected, and the subsequent delete causes the exact data loss described in #2926.
Consider returning the GET error (or a wrapped error surfacing that the check could not be completed) instead of silently proceeding:
if !apierrors.IsNotFound(getErr) {
return fmt.Errorf("pre-create existence check for %s/%s failed: %w — "+
"cannot confirm object does not exist; use upsertExistingObjects=true to skip this check",
c.Inputs.GetNamespace(), name, getErr)
}There was a problem hiding this comment.
Addressed.
The Relinquish fallback fired on any 403 Forbidden, not just protected resources — including RBAC denials and admission webhook rejections. This could silently corrupt state by removing resources the user intended to keep. Users who need to manage fields on protected resources (like kube-system) should use Patch resources, which already use Relinquish on delete. Users who opt into upsertExistingObjects accept the risks of delete failures on protected resources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d355ea1 to
c8cfe1b
Compare
|
/review-again |
- Return error on non-NotFound GET failures instead of silently bypassing the existence guard. A 403 or 503 on GET now surfaces a clear error suggesting upsertExistingObjects as a workaround. - Add deleteBeforeReplace to TestPreviewReplacements Job, since create-before-delete with an explicit name hits the existence check. - Add upsertExistingObjects to TestServerSideApply provider, since the test explicitly exercises upsert behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c8cfe1b to
e24bc47
Compare
|
/review-again |
7ec4304 to
74e8efb
Compare
|
/review-again |
|
/review-again |
| @@ -0,0 +1,11 @@ | |||
| name: existing-object | |||
| return enabled == trueStr | ||
| } | ||
| if enabled, exists := os.LookupEnv("PULUMI_K8S_UPSERT_EXISTING_OBJECTS"); exists { | ||
| return enabled == trueStr |
There was a problem hiding this comment.
(nitpick) If I understand this correctly, would using in bashexport PULUMI_K8S_UPSERT_EXISTING_OBJECTS=1 would fail this check since we are comparing exclusively for == true. Is that desirable? I would expect =1 to be a valid way to set it to true. What do you think?
There was a problem hiding this comment.
It's the way that the other environment variables are set up. :/ I think to change it we should change it everywhere.
There was a problem hiding this comment.
Code Review Summary
I reviewed the full diff. Below are the definite issues found, ordered by severity.
note This is not true, server-side apply defaults to true. The check hits the code path just fine.
Issue 1 — The new E2E test does not exercise the new SSA code path
File: tests/sdk/yaml/existing_object_test.go + tests/sdk/yaml/testdata/existing-object/Pulumi.yaml
Problem: The TestExistingObjectBlocksCreate E2E test is intended to verify that the new SSA blocking behavior works end-to-end. However, testdata/existing-object/Pulumi.yaml does not configure enableServerSideApply: true on the provider. Because serverSideApplyMode defaults to false, the test exercises the old non-SSA code path (client.Create() → API server returns AlreadyExists), not the new pre-create GET check added in await.go.
The gate condition in await.go is:
if c.ServerSideApply && !c.Preview && !c.UpsertExistingObjects && !kinds.IsPatchResource(...) {Without enableServerSideApply: true, c.ServerSideApply is false and the new check is never reached. The test passes, but it would continue passing even if the new SSA guard were completely removed. There is no E2E test that would catch a regression in the new code path.
Why it's definitely a real problem: The E2E test was added specifically to validate the new blocking behavior, but it silently validates existing (unrelated) behavior instead. Any regression in the new SSA guard would go undetected by E2E tests.
Fix: Add config: kubernetes:enableServerSideApply: "true" (or a provider resource with enableServerSideApply: true) to testdata/existing-object/Pulumi.yaml.
Issue 2 — Existing SSA E2E tests (step1/2/3) now mask regressions in normal SSA creates
Files:
tests/sdk/nodejs/server-side-apply/step1/index.tsline 27tests/sdk/nodejs/server-side-apply/step2/index.tsline 27tests/sdk/nodejs/server-side-apply/step3/index.tsline 27
Problem: All three steps now use upsertExistingObjects: true. Step 1's comment at line 3 explicitly states: "3. Upsert a Deployment resource that already exists." — the nginx-upsert resource (lines 130–147) intentionally creates a second Pulumi resource pointing at the same named cluster object as nginx.
With upsertExistingObjects: true, the pre-create existence check is bypassed for all SSA creates in these tests, including creates that are not intentional upserts. This means if the new existence check logic were accidentally inverted (blocking creates of non-existing objects, or incorrectly firing on auto-named resources), these tests would not catch it because the bypass is always active.
Why it's definitely a real problem: The three step tests are the primary integration tests for SSA create behavior. With the bypass on globally, they no longer cover the default (and new default-safe) code path. A logic error in the existence check that causes false positives for new objects would be invisible to these tests.
The correct fix would be to split the nginx-upsert scenario out into its own test that explicitly uses upsertExistingObjects: true, and restore the remaining tests to upsertExistingObjects: false (or omit the flag entirely since false is the default). This would mean the main step1/2/3 tests continue to exercise SSA creates against non-existing objects under the new default behavior.
Issue 3 — Existence check is inside the retry loop and runs on every retry
File: provider/pkg/await/await.go, lines 190–217
Problem: The pre-create GET check runs inside the SleepingRetry closure. On the very first iteration (i == 0), the client variable is nil and gets initialized. On the next retry (if one somehow occurs), client is non-nil so the check runs again. This is benign for the "already exists" case (the error is not IsNotFound or IsNoMatchError, so the retrier stops after the first failure and the check runs exactly once).
However, for the non-404, non-NoMatch error branch (lines 208–215), the error returned IS a plain fmt.Errorf — not IsNotFound, not IsNoMatchError — so the retrier also stops immediately. This is correct behavior.
The placement is safe for all currently covered cases, but it does mean that if client.Get fails transiently (e.g., a momentary network issue), the entire create is hard-failed with a message telling the user to set upsertExistingObjects=true. This is a definite behavioral regression: a transient GET failure is not the same as "cannot confirm the object does not exist," but the error message frames it that way and blocks the create.
Why it's definitely a real problem: A transient API server blip during the GET — which is not retried — permanently blocks the create with a misleading error message directing users toward a flag that bypasses the safety check. The GET should either be retried or its transient-failure error should be clearly distinguished from a permanent one.
go.sum — No suspicious entries
All additions to sdk/go.sum are /go.mod-only indirect dependency hashes or well-known packages (Azure SDK, AWS SDK, Google Cloud, HashiCorp Vault, etc.) consistent with a Pulumi SDK dependency update. No suspicious entries were found.
Reviewed by Internal PR Re-Review (Slash Command)
To install this agentic workflow, run
gh aw add pulumi-labs/gh-aw-internal/.github/workflows/gh-aw-pr-rereview.md@734ef41746387a6818fd8ac3e619c9fd81ac6957
74e8efb to
d37ae0a
Compare
During preview, the engine plans operations but doesn't execute them. A planned delete-before-create for a replacement won't have run yet, so the object is still in the cluster — making the existence check a false positive. Skip the GET when c.Preview is true. The check still runs on the actual up (and up --skip-preview), since Creation() is called for both phases. Tradeoff: preview no longer fails fast on the rename/upsert case. Users get the error on apply instead. Also revert the deleteBeforeReplace addition to TestPreviewReplacements testdata, since it's no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d37ae0a to
e554ce8
Compare
Use a separate provider with upsertExistingObjects for the nginx-upsert resource that intentionally targets an existing object. The main provider uses the default behavior so that other SSA creates in the test exercise the new existence check code path. Also remove debug panic from await.go. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/review-again |
There was a problem hiding this comment.
Review Summary
Three bugs found. One is blocking; two are non-blocking.
1. Blocking — GET errors silently bypass the existence check (provider/pkg/await/await.go, line 196)
The guard only acts on getErr == nil. Any error other than "object found" — including 403 Forbidden, network failures, and 503 ServiceUnavailable — is silently swallowed and the create proceeds as if the object does not exist.
A user whose RBAC allows create/patch but not get (a common least-privilege setup) will completely bypass this guard on every run. The original #2926 silent-upsert-then-delete bug is still fully reproducible for those users after this PR merges.
Fix: treat any non-NotFound GET error as a hard failure rather than a silent pass-through.
2. Non-blocking — PULUMI_K8S_UPSERT_EXISTING_OBJECTS=1 is silently ignored (provider/pkg/provider/provider.go, lines 633–641)
The env var is compared only to the literal string "true". Setting it to 1, TRUE, or yes (all standard Unix truthy values) leaves the flag disabled with no warning. This is load-bearing for an escape hatch: a user who sets the env var in a normal CI/shell style and sees the new blocking error will have no idea the flag is being ignored.
Fix: use strconv.ParseBool, or at minimum document in the schema description and release notes that only the exact string "true" is accepted.
3. Non-blocking — generateName resources silently skip the existence check (provider/pkg/await/await.go, line 195)
The if name := c.Inputs.GetName(); name != "" guard skips the check for any resource without an explicit name. For purely generated names this is harmless. However, there is no assertion that generateName is actually set in the empty-name case, and the code path is undocumented. A future refactor or an unusual resource shape could reach this branch unintentionally and silently re-introduce the #2926 bug.
Fix: add a comment explaining the deliberate skip and, optionally, assert that GetGenerateName() != "" when the name is empty.
Other focus areas (no issues found)
- Flag propagation:
UpsertExistingObjectsis correctly set onCreateConfig(line 1740),UpdateConfig(line 2249), andDeleteConfig(line 2417). - Schema type:
upsertExistingObjectsis typed asbooleanand the description accurately describes the opt-in nature and the #2926 scenario. TheDefaultInfo.Environmententry correctly points toPULUMI_K8S_UPSERT_EXISTING_OBJECTS. No schema inconsistencies found. - Security (resource existence exposure): The GET is performed with the provider's own credentials, so it does not expose resource existence to end-users beyond what their RBAC already grants. No additional attack surface introduced.
Reviewed by Internal PR Re-Review (Slash Command)
To install this agentic workflow, run
gh aw add pulumi-labs/gh-aw-internal/.github/workflows/gh-aw-pr-rereview.md@734ef41746387a6818fd8ac3e619c9fd81ac6957
| _, getErr := client.Get(c.Context, name, metav1.GetOptions{}) | ||
| if getErr == nil { | ||
| msg := fmt.Sprintf( | ||
| "resource %s/%s (%s) already exists in the cluster. "+ | ||
| "To resolve this, you can:\n"+ | ||
| " 1. Use aliases if renaming a Pulumi resource.\n"+ | ||
| " 2. Use deleteBeforeReplace if replacing a resource with an explicit name.\n"+ | ||
| " 3. Use a Patch resource to manage specific fields on existing objects.\n"+ | ||
| " 4. Set upsertExistingObjects on the provider to update existing objects.", | ||
| c.Inputs.GetNamespace(), name, c.Inputs.GetKind(), | ||
| ) | ||
| if c.Preview { | ||
| _ = c.Host.Log(c.Context, diag.Warning, c.URN, msg) | ||
| } else { | ||
| return fmt.Errorf("%s", msg) | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug (blocking): GET errors other than "not found" silently bypass the existence check
The existence check only acts when getErr == nil (object found). Any other error — Forbidden, network timeout, ServiceUnavailable, etc. — is silently ignored and the create proceeds as if the object does not exist.
_, getErr := client.Get(c.Context, name, metav1.GetOptions{})
if getErr == nil {
// block/warn
}
// getErr != nil → falls through, create proceedsA user whose RBAC permits create/patch but not get (a common least-privilege pattern) will bypass this guard entirely on every run. The original #2926 bug — silent upsert followed by destructive delete — will still occur for those users even after this PR.
The fix is to propagate non-NotFound errors rather than silently proceeding:
_, getErr := client.Get(c.Context, name, metav1.GetOptions{})
if getErr == nil {
// existing block/warn logic
} else if !apierrors.IsNotFound(getErr) {
return fmt.Errorf("failed to check whether resource %s/%s (%s) already exists: %w",
c.Inputs.GetNamespace(), name, c.Inputs.GetKind(), getErr)
}| upsertExistingObjects := func() bool { | ||
| if enabled, exists := vars["kubernetes:config:upsertExistingObjects"]; exists { | ||
| return enabled == trueStr | ||
| } | ||
| if enabled, exists := os.LookupEnv("PULUMI_K8S_UPSERT_EXISTING_OBJECTS"); exists { | ||
| return enabled == trueStr |
There was a problem hiding this comment.
Bug (non-blocking): PULUMI_K8S_UPSERT_EXISTING_OBJECTS=1 does not enable the flag
The env var is compared only against the string literal "true":
upsertExistingObjects := func() bool {
if enabled, exists := vars["kubernetes:config:upsertExistingObjects"]; exists {
return enabled == trueStr // trueStr = "true"
}
if enabled, exists := os.LookupEnv("PULUMI_K8S_UPSERT_EXISTING_OBJECTS"); exists {
return enabled == trueStr // only "true" accepted
}
return false
}Standard Unix convention (and the Pulumi docs for other env vars) treats 1, yes, TRUE, and true as truthy. A user who sets PULUMI_K8S_UPSERT_EXISTING_OBJECTS=1 (following normal shell/CI conventions) gets the new blocking behavior with no indication why the flag appears to be ignored.
This is consistent with how all other PULUMI_K8S_* env vars are handled in this file, so it is a pre-existing pattern — but it is newly load-bearing here because the flag's only purpose is to opt out of a breaking change. Consider using strconv.ParseBool so that 1, TRUE, true, yes all work, or at minimum document in the schema description that only "true" is accepted.
| // prevents silent upsert of existing objects, which can lead | ||
| // to data loss when the engine later deletes a resource that | ||
| // points to the same cluster object (see #2926, #2948). | ||
| if name := c.Inputs.GetName(); name != "" { |
There was a problem hiding this comment.
Bug (non-blocking): Resources using generateName silently skip the existence check
The guard if name := c.Inputs.GetName(); name != "" means any resource that uses generateName (no explicit metadata.name) bypasses the check entirely. This is intentional and correct for truly new generated names, but the comment does not document this assumption, and the condition does not distinguish between "name is empty because generateName is set" and "name is empty for some other reason."
More importantly: a Pulumi resource that is renamed in the program may have been tracked under an explicit name in state but arrive here without a name in c.Inputs if the name was computed. In that edge case the existence guard silently does nothing and the original #2926 silent-upsert-then-delete bug resurfaces.
This warrants a comment explaining the deliberate skip, and possibly an assertion that generateName is set when GetName() is empty.
| _, getErr := client.Get(c.Context, name, metav1.GetOptions{}) | ||
| if getErr == nil { | ||
| msg := fmt.Sprintf( | ||
| "resource %s/%s (%s) already exists in the cluster. "+ | ||
| "To resolve this, you can:\n"+ | ||
| " 1. Use aliases if renaming a Pulumi resource.\n"+ | ||
| " 2. Use deleteBeforeReplace if replacing a resource with an explicit name.\n"+ | ||
| " 3. Use a Patch resource to manage specific fields on existing objects.\n"+ | ||
| " 4. Set upsertExistingObjects on the provider to update existing objects.", | ||
| c.Inputs.GetNamespace(), name, c.Inputs.GetKind(), | ||
| ) | ||
| if c.Preview { | ||
| _ = c.Host.Log(c.Context, diag.Warning, c.URN, msg) | ||
| } else { | ||
| return fmt.Errorf("%s", msg) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Blocking: non-404 GET errors silently bypass the existence guard
The check at line 197 only acts when getErr == nil (confirmed present). Any non-nil error — including 403 Forbidden (operator lacks get but has create/patch), 503 Service Unavailable, or a transient network failure — falls through and proceeds to the SSA Patch with no error or log.
This permanently disables the guard for any least-privilege setup that allows create/patch but not get on the resource — the exact cluster posture this PR is meant to protect.
Suggested fix: only treat a confirmed IsNotFound as safe-to-create; return an error on anything else:
_, getErr := client.Get(c.Context, name, metav1.GetOptions{})
if getErr == nil {
// object exists — warn or fail
...
} else if !apierrors.IsNotFound(getErr) {
return fmt.Errorf("failed to check whether %s/%s already exists: %w",
c.Inputs.GetNamespace(), name, getErr)
}
// only reach here if IsNotFound(getErr) — safe to createContext:
pulumi-kubernetes/provider/pkg/await/await.go
Lines 195 to 213 in 029b0b2
Note: this was raised in a previous review thread (marked "Addressed" by the author) but the if getErr == nil form is still present in the current code at HEAD 029b0b2d5. The unit tests in TestCreationSSAExistingObject do not include a case where the GET itself returns a non-404 error, so the regression is not caught by the test suite.
This pull request addresses the issues laid out in #2926.
Specifically, here we are approaching #2926 as a bug. When a user attempts to delete an object that is in Pulumi state, such a delete should succeed. Because Create defaults to using server-side apply, we can wind up with undeleteable objects in Pulumi state. I would argue that this is a bug.
Therefore, this PR does the following:
Add a GET call before each Create to assert that an object already exists. This is perhaps the most controversial choice since it adds a full ; however it allows us to keep using Server-Side Apply on Create, rather than reverting to client-side apply, which I understand has some migration issues wrt field managers.
Fail Create when an object already exists. The user is suggested three options:
DeleteBeforeReplaceresource option for renamesWe don't tell the user to explicitly put the object into state via Import, because that will result in Deletes hitting the Forbidden erorr.
Relinquish on Forbidden delete: When the API server forbids deletion (e.g. protected namespaces likeThis was too broad. Users should be directed towards patch resources, which can be safely deleted.kube-system), fall back to relinquishing managed fields instead of failing. This is for situations where a user explicitly imported a protected object.Also adds
upsertExistingObjectsprovider config: Users have reported they like this behavior for bulk import purposes. Therefore, we also provide an opt-in flag for those users.Fixes #2926
Refs #2948
Test plan
upsertExistingObjectsskips check)defaultnamespace fails with "already exists" error🤖 Generated with Claude Code