Skip to content

fix: retry for database transaction errors. Fixes #16101#16102

Open
isubasinghe wants to merge 1 commit into
argoproj:mainfrom
pipekit:fix-db-sync-retry
Open

fix: retry for database transaction errors. Fixes #16101#16102
isubasinghe wants to merge 1 commit into
argoproj:mainfrom
pipekit:fix-db-sync-retry

Conversation

@isubasinghe
Copy link
Copy Markdown
Member

@isubasinghe isubasinghe commented May 15, 2026

Fixes #16101

Motivation

Workflows using database-backed mutexes or semaphores fail permanently on PostgreSQL SQLSTATE 40001 during lock acquisition. 40001 is a normal serializable-isolation
conflict; the retry loop in TryAcquire is built for it but never fires.

databaseSemaphore.acquire drops the SQL error and returns plain false. tryAcquireImpl wraps the bare failure as bug: failed to acquire something that should have been checked: (empty payload). The retry detector substring-matches serialization/dependencies/deadlock/rollback against that string, finds nothing, and gives
up after one attempt.

Modifications

  • acquire returns (bool, error); tryAcquire returns (bool, string, error). SQL error propagates.
  • tryAcquireImpl returns the underlying error directly. The bug: wrap stays for its original case: (false, nil) after checkAcquire approved.
  • Retry loop rewritten with retry.OnError and wait.Backoff{Duration: 10ms, Factor: 2, Steps: 5, Jitter: 0.5, Cap: 600ms}. Old loop had zero delay; jitter avoids
    replica lockstep.
  • Substring check extracted into isRetryableSyncError.
  • prioritySemaphore and the semaphore interface updated to match. In-memory acquire returns nil error. Initialize callers still drop the error — pre-existing,
    separate change.

Verification

Test pass

Documentation

None needed.

AI

Claude Code (Claude Opus 4.7) used for investigation, tests, fix, and this description. Reviewed by hand; Cap raised from 200ms to 600ms during review.

Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe marked this pull request as ready for review May 15, 2026 06:40
Comment thread workflow/sync/semaphore.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Database sync: transient SQLSTATE 40001 misclassified as "bug"

1 participant