Skip to content

Commit 8723547

Browse files
Clara Rullmeta-codesync[bot]
authored andcommitted
Handle ZCONNECTIONLOSS with exponential backoff
Summary: Zeus ZCONNECTIONLOSS errors are used for loadshedding (see https://fb.workplace.com/groups/zeus.users/permalink/31336632699291946/). Currently these fall through to `InternalError::Other`, which uses quadratic backoff starting at 100ms — the worker crashes after 5 consecutive failures in ~3 seconds. This is too aggressive and doesn't give Zeus enough breathing room to recover. This diff adds a dedicated `TransientZeusError` variant to `InternalError` that matches `ZCONNECTIONLOSS` and applies exponential backoff starting at 500ms (500ms, 1s, 2s, 4s, then crash). This extends the total retry window from ~3s to ~7.5s before the worker crashes. This is a mitigation while we request more Zelos capacity. Alert: https://fburl.com/onedetection/yrvpdgcc Reviewed By: YousefSalama Differential Revision: D99992330 fbshipit-source-id: 28193c9f6744579b89d38e38c59c961c9050513e
1 parent 5b4eb52 commit 8723547

File tree

1 file changed

+6
-0
lines changed
  • eden/mononoke/repo_attributes/repo_derivation_queues/src

1 file changed

+6
-0
lines changed

eden/mononoke/repo_attributes/repo_derivation_queues/src/errors.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ pub enum InternalError {
3131
ItemDeleted(String),
3232
#[error("Attepmt to create Derivation Item with dependency on itself {0:#?}")]
3333
CircularDependency(DagItemId),
34+
#[error("Transient Zeus connection error: {0}")]
35+
TransientZeusError(String),
3436
#[error(transparent)]
3537
Other(#[from] anyhow::Error),
3638
}
@@ -55,6 +57,10 @@ impl From<zeus_client::ZeusError> for InternalError {
5557
message: msg,
5658
exception_type: ZelosExceptionType::ZNONODE,
5759
} => InternalError::ItemDeleted(msg),
60+
zeus_client::ZeusError::RuntimeError {
61+
message: msg,
62+
exception_type: ZelosExceptionType::ZCONNECTIONLOSS,
63+
} => InternalError::TransientZeusError(msg),
5864
_ => InternalError::Other(e.into()),
5965
}
6066
}

0 commit comments

Comments
 (0)