PB-1916 Add opt-in promised failure declaration to bktec run#539
Draft
123sarahj123 wants to merge 1 commit into
Draft
PB-1916 Add opt-in promised failure declaration to bktec run#539123sarahj123 wants to merge 1 commit into
123sarahj123 wants to merge 1 commit into
Conversation
When `BUILDKITE_TEST_ENGINE_PROMISE_FAILURE=true`, bktec declares an
early ("promised") failure to the Buildkite Agent API once retries are
exhausted and hard (non-muted) failures remain. This lets a build cascade
to `failing` before the job actually exits, enabling fast feedback and
sibling-job cancellation (Job Early Failure Detection, PB-1916).
- Add internal/agent package with PromiseFailure, which PUTs to
{BUILDKITE_AGENT_ENDPOINT}/jobs/{BUILDKITE_JOB_ID}/promise_failure
authenticated with the agent access token. Kept separate from the
Test Engine api.Client, which targets a different service and token.
- Add config + flags: BUILDKITE_TEST_ENGINE_PROMISE_FAILURE (opt-in,
default off), plus BUILDKITE_AGENT_ENDPOINT / BUILDKITE_AGENT_ACCESS_TOKEN.
- Call it from command.Run after retries are exhausted, gated on
FailedTests() so muted-only runs never promise. Best-effort: a promise
error is logged and never changes the run's real exit status.
Verified end-to-end against a real test org: the promise is recorded
("Declared Early Failure") and the build cascades to failing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add opt-in promised failure declaration to bktec
What
When
BUILDKITE_TEST_ENGINE_PROMISE_FAILURE=true, bktec declares an early("promised") failure to the Buildkite Agent API once it has exhausted all
configured retries and still has hard (non-muted) test failures. This lets
a running job tell Buildkite "I'm going to fail" before it actually exits,
enabling the build to cascade to
failingearly.Implements PB-1916 (the bktec dogfooding path for Job Early Failure Detection).
Why bktec
A first failed test execution ≠ a failed job. RSpec under bktec retries flaky
tests in-job (
BUILDKITE_TEST_ENGINE_RETRY_COUNT), and muted-test failuresdon't fail the job. Any signal off the raw first failure produces false
promises. bktec is the only layer that knows both that retries are
exhausted and which failures are muted, so it's the correct place to
declare the promise.
How
internal/agentpackage withPromiseFailure(...), which PUTs to{BUILDKITE_AGENT_ENDPOINT}/jobs/{BUILDKITE_JOB_ID}/promise_failurewithAuthorization: Token {BUILDKITE_AGENT_ACCESS_TOKEN}and body{"exit_status":1,"reason":"..."}.promiseFailureIfNeededis called fromRun()after the retry loop returnsand after the signal-abort check. It only fires on
runResult.FailedTests()(hard failures), never on
FailedMutedTests().BUILDKITE_TEST_ENGINE_PROMISE_FAILUREflag(default off).
exit-status semantics are unchanged.
Uses the Agent API (not the existing
internal/api.Client, which targets theTest Engine API with a different token). Calls the HTTP endpoint directly
because the
buildkite-agentCLI promise command isn't shipped yet.Testing
internal/agent/agent_test.go: request shape, non-2xx error, arg validation.internal/command/promise_test.go: flag on/off, no-failures, hard-failurespromise, muted-only no-promise, and best-effort swallow on agent error.
Dogfood proof (end-to-end, real org)
Ran against
sarahs-test-orgwith a local agent running the locally-builtbktec binary, suite
job-early-failure-declaration-dogfood, using thebuildkite/rspec-junit-examplerepo (2 passing / 2 failing specs). The jobtimeline showed Declared Early Failure — promised exit status
1, reasontest_failure (2 failed after retries)— and the build cascaded tofailing,cancelling the sibling job (
cancel_on_build_failing: true).Caveat: the example repo's old RSpec setup crashes on the retry pass
(malformed retry JSON). This is a demo-repo quirk and does not affect the
promise path and the promise still fired correctly from the remaining hard
failures.