Skip to content

Route SP check jobs to workers by SP region (multi-region routing) #479

@SgtPooki

Description

@SgtPooki

Summary

Once dealbot is deployed multi-region (per FilOzone/infra#69 — workers in nbg1, ash, sin), all worker pods will pull from the same pg-boss queue. Without job routing by SP region, an APAC SP's job can still be picked up by an EU worker, so the multi-region deployment alone does not fix APAC reachability failures.

Background

Investigation in #478 showed APAC SPs (e.g. ruka-main, superusey-calib) fail dealbot checks from the current single-region (nbg1) egress due to long-haul TCP/TLS issues, not SP-side bugs. Tippy confirmed those SPs are healthy from APAC clients.

FilOzone/infra#69 adds worker pods in 3 regions and a `region` metric label, but does not introduce any routing layer. Job-to-worker assignment remains "random" via shared queue.

Acceptance criteria

  • Provider config carries a region tag (e.g. `spRegion: 'EU' | 'NA' | 'APAC'`) — manual tagging is acceptable initially
  • Jobs for an SP are dispatched to a queue scoped to that SP's region
  • Each worker pod subscribes only to the queue matching its own deployment region (env var driven)
  • Jobs for SPs without a region tag fall back to a default queue / region (probably EU)
  • `dataSetCreationStatus` and other check metrics, broken down by `region` label, show APAC SPs succeeding when handled by sin worker
  • Documented runbook for adding new regions or moving SPs between regions

Suggested approach (lightest)

  1. Add `spRegion` to provider model / config.
  2. In `apps/backend/src/jobs/jobs.service.ts`, when enqueuing a job for an SP, derive queue name from SP region (e.g. `sp-jobs-eu`, `sp-jobs-na`, `sp-jobs-apac`).
  3. Worker subscribes to a single queue selected via `DEALBOT_REGION` env var injected by the deployment (already varies per region in the infra PR).
  4. Default-route untagged SPs to the EU queue to preserve current behavior.

Out of scope

  • Auto-detecting SP region from chain-registered metadata (could be future work — start manual)
  • Queue rebalancing across regions when one is down (start with strict region routing; revisit if it becomes a reliability problem)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestneeds-triageWalked but missing Problem/NextSteps/DoD; awaiting clarification

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    🐱 Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions