Skip to content

PB-1916 Add opt-in promised failure declaration to bktec run#539

Draft
123sarahj123 wants to merge 1 commit into
mainfrom
pb-1916-implement-bktec-dogfooding-path
Draft

PB-1916 Add opt-in promised failure declaration to bktec run#539
123sarahj123 wants to merge 1 commit into
mainfrom
pb-1916-implement-bktec-dogfooding-path

Conversation

@123sarahj123

@123sarahj123 123sarahj123 commented Jun 10, 2026

Copy link
Copy Markdown

Add opt-in promised failure declaration to bktec

What

When BUILDKITE_TEST_ENGINE_PROMISE_FAILURE=true, bktec declares an early
("promised") failure to the Buildkite Agent API once it has exhausted all
configured retries and still has hard (non-muted) test failures. This lets
a running job tell Buildkite "I'm going to fail" before it actually exits,
enabling the build to cascade to failing early.

Implements PB-1916 (the bktec dogfooding path for Job Early Failure Detection).

Why bktec

A first failed test execution ≠ a failed job. RSpec under bktec retries flaky
tests in-job (BUILDKITE_TEST_ENGINE_RETRY_COUNT), and muted-test failures
don't fail the job. Any signal off the raw first failure produces false
promises. bktec is the only layer that knows both that retries are
exhausted and which failures are muted, so it's the correct place to
declare the promise.

How

  • New internal/agent package with PromiseFailure(...), which PUTs to
    {BUILDKITE_AGENT_ENDPOINT}/jobs/{BUILDKITE_JOB_ID}/promise_failure with
    Authorization: Token {BUILDKITE_AGENT_ACCESS_TOKEN} and body
    {"exit_status":1,"reason":"..."}.
  • promiseFailureIfNeeded is called from Run() after the retry loop returns
    and after the signal-abort check. It only fires on runResult.FailedTests()
    (hard failures), never on FailedMutedTests().
  • Gated behind the opt-in BUILDKITE_TEST_ENGINE_PROMISE_FAILURE flag
    (default off).
  • Best-effort: if the promise call fails, we log and continue and bktec's real
    exit-status semantics are unchanged.

Uses the Agent API (not the existing internal/api.Client, which targets the
Test Engine API with a different token). Calls the HTTP endpoint directly
because the buildkite-agent CLI promise command isn't shipped yet.

Testing

  • internal/agent/agent_test.go: request shape, non-2xx error, arg validation.
  • internal/command/promise_test.go: flag on/off, no-failures, hard-failures
    promise, muted-only no-promise, and best-effort swallow on agent error.

Dogfood proof (end-to-end, real org)

Ran against sarahs-test-org with a local agent running the locally-built
bktec binary, suite job-early-failure-declaration-dogfood, using the
buildkite/rspec-junit-example repo (2 passing / 2 failing specs). The job
timeline showed Declared Early Failure — promised exit status 1, reason
test_failure (2 failed after retries) — and the build cascaded to failing,
cancelling the sibling job (cancel_on_build_failing: true).

Caveat: the example repo's old RSpec setup crashes on the retry pass
(malformed retry JSON). This is a demo-repo quirk and does not affect the
promise path and the promise still fired correctly from the remaining hard
failures.

When `BUILDKITE_TEST_ENGINE_PROMISE_FAILURE=true`, bktec declares an
early ("promised") failure to the Buildkite Agent API once retries are
exhausted and hard (non-muted) failures remain. This lets a build cascade
to `failing` before the job actually exits, enabling fast feedback and
sibling-job cancellation (Job Early Failure Detection, PB-1916).

- Add internal/agent package with PromiseFailure, which PUTs to
  {BUILDKITE_AGENT_ENDPOINT}/jobs/{BUILDKITE_JOB_ID}/promise_failure
  authenticated with the agent access token. Kept separate from the
  Test Engine api.Client, which targets a different service and token.
- Add config + flags: BUILDKITE_TEST_ENGINE_PROMISE_FAILURE (opt-in,
  default off), plus BUILDKITE_AGENT_ENDPOINT / BUILDKITE_AGENT_ACCESS_TOKEN.
- Call it from command.Run after retries are exhausted, gated on
  FailedTests() so muted-only runs never promise. Best-effort: a promise
  error is logged and never changes the run's real exit status.

Verified end-to-end against a real test org: the promise is recorded
("Declared Early Failure") and the build cascades to failing.
@123sarahj123 123sarahj123 changed the title Add opt-in promised failure declaration to bktec run PB-1916 Add opt-in promised failure declaration to bktec run Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant