Skip to content

Fix false-positive success in remediateSupplier#243

Merged
jorgecuesta merged 2 commits intostagingfrom
fix/remediate-supplier-error-handling
Mar 26, 2026
Merged

Fix false-positive success in remediateSupplier#243
jorgecuesta merged 2 commits intostagingfrom
fix/remediate-supplier-error-handling

Conversation

@Alann27
Copy link
Copy Markdown
Collaborator

@Alann27 Alann27 commented Mar 24, 2026

The remediateSupplier activity was returning { success: false, ... } objects on failure instead of throwing errors. Because Temporal only retries activities that throw, these failures were silently treated as successful completions — no retries, no workflow-level error tracking.

This PR replaces all { success: false } returns with appropriate ApplicationFailure throws:

  • Non-retryable (ApplicationFailure.nonRetryable) for data/config issues that won't resolve on retry:

    • Key not found
    • Address group not found
    • Empty services config
    • Stake transaction failed (already persisted as RemediationFailed in DB)
  • Retryable (ApplicationFailure.retryable) for transient errors that may succeed on retry:

    • Supplier not found (timing issue, may appear on next attempt)
    • DB update failures (clearing OwnerInitialStake or post-stake update)

Error messages now include the supplier address, key state, and transaction result details for observability.

@Alann27 Alann27 requested a review from jorgecuesta March 24, 2026 16:43
@Alann27 Alann27 self-assigned this Mar 24, 2026
@Alann27 Alann27 added the enhancement New feature or request label Mar 24, 2026
@jorgecuesta jorgecuesta force-pushed the fix/remediate-supplier-error-handling branch from 033ddeb to 529fc69 Compare March 24, 2026 21:11
@Alann27 Alann27 force-pushed the fix/remediate-supplier-error-handling branch from 529fc69 to 0b1aa85 Compare March 24, 2026 21:20
@jorgecuesta jorgecuesta changed the base branch from main to staging March 26, 2026 14:45
@jorgecuesta jorgecuesta force-pushed the fix/remediate-supplier-error-handling branch from 0b1aa85 to cbaed3f Compare March 26, 2026 15:06
@jorgecuesta jorgecuesta merged commit 6e5cd06 into staging Mar 26, 2026
6 checks passed
@jorgecuesta jorgecuesta deleted the fix/remediate-supplier-error-handling branch March 26, 2026 15:17
jorgecuesta pushed a commit that referenced this pull request Mar 26, 2026
* Update `remediateSupplier` to use `ApplicationFailure` exceptions for improved error handling and retry logic

* Update `upsertSupplierStatus` to return detailed result with state, remediation reasons, and supplier context
jorgecuesta added a commit that referenced this pull request Mar 26, 2026
* Add branch-gate check: only release branch can target main (#250)

* Improve bootstrap UX, workflow schedule management, and setup wizard (#251)

* Fix workflow bootstrap to evaluate and update existing schedules

- Bootstrap now compares args and interval of existing schedules against
  desired config. If different, logs warning and updates via Temporal API.
- Previously, existing schedules were always skipped regardless of config drift.
- Add ENV variable overrides for schedule intervals (SCHEDULE_<TYPE>_INTERVAL)
- Unify config: args, interval, and ENV var name in single record per workflow
- Fix dev overlay RPC URL: shannon-testnet-grove-rpc → sauron-rpc.beta

* Wait for app bootstrap before creating Temporal schedules

Workers now poll the database for isBootstrapped=true before proceeding
with schedule creation and starting the Temporal worker. This prevents
creating schedules for apps that haven't been set up through the web UI.

- Add isBootstrapped() method to both DALs (clean query, no noisy warnings)
- Poll interval configurable via BOOTSTRAP_POLL_INTERVAL_MS (default: 5000ms)

* Improve bootstrap UX, workflow schedule management, and setup wizard

Bootstrap workflows:
- Bootstrap now compares args and interval of existing schedules and updates if different
- Add ENV variable overrides for schedule intervals (SCHEDULE_<TYPE>_INTERVAL)
- Workers wait for app isBootstrapped=true before creating Temporal schedules
- Fix dev overlay RPC URL to use sauron-api.beta

Setup wizard (both apps):
- Fix RPC/Indexer URL validation: clear errors on change, no focus loss during validation
- Reset height when switching networks to prevent false height regression errors
- Add .prettierignore for proto/generated files
- Fix provider setup layout: remove sidebar, add scroll support
- Compact step navigation (prev ← current → next) for provider's 8 steps
- Add skeleton loading states instead of spinner
- Add SetupHelpBar component with links to docs for every step
- Add quick-fill buttons (Mainnet/Beta) for Node API + Indexer URLs
- Provider settings form: real-time validation with debounce, chain-id mismatch as error, height regression as warning
- Fix form label/input alignment in shared UI components

Tables and dialogs:
- Redesign table component: bordered container, header separation, row hover
- Fix DataTable: align cell padding with headers, always show action buttons
- All dialogs: max-h-[90vh] with scroll, consistent styling
- RelayMiner dialog: auto-generate identity from name, auto-select single region
- Region dialog: auto-generate urlValue from displayName
- Service dialog: load all services from chain, local search with dropdown
- Address group dialog: real buttons for Add Share/Add Address, Trash2Icon for remove
- Switch component: larger size, better unchecked contrast
- Back buttons use outline variant across all steps

Bootstrap complete step:
- Summary with settings details and entity counts
- Overview page: graceful handling of empty state with summary cards

* Split remediation schedules, UI controls, auto-bootstrap, and tests (#253)

* Split remediation schedules, add UI controls, auto-bootstrap, and tests

Remediation schedule split:
- Separate SupplierRemediation (1001 ServiceMismatch, pausable) from
  SupplierInitialStake (1003 OwnerInitialStake, always active)
- Add workflowType field to decouple schedule key from Temporal workflow
- Bootstrap auto-updates existing schedule args on deploy

Provider UI controls:
- Auto Stake toggle: pause/unpause remediation schedule from keys page
- Request Remediation: two-step evaluate→confirm flow for manual trigger
- Rename "Mark for Remediation" to "Clear Remediation" for clarity
- New server actions: Schedules.ts (get/toggle/trigger), Remediation.ts
  (evaluate needs, request remediation with Zod validation)

Workflow fixes:
- Owner sync: backfill empty ownerAddress from on-chain data for legacy keys
- Early return in SupplierStatus/SupplierRemediation when no keys exist
- Fix Temporal client defaults (middleman→provider namespace)

Auto-bootstrap for dev:
- Bootstrap seed scripts for provider and middleman (standalone, idempotent)
- Fetch minimumStake and height from Pocket API at seed time
- Fetch delegators/providers from governance CDN (single source of truth)
- Tilt conditionally injects init container + ConfigMap from local JSON
- Example configs with pocket-lego-testnet defaults

Tests:
- Fixture file with 7 realistic supplier/key scenarios
- 8 new tests for getExpectedServicesFromKey and getSupplierActiveServices
- 6 new comparison handler tests covering edge cases from re-staking bug

Docs and UI consistency:
- Unify "Node API URL"/"Shannon API URL" to "Pocket API URL" across both apps
- Update DEVELOP.md with auto-bootstrap setup instructions
- Update key-management docs with new button names and actions

* Fix jest moduleNameMapper for enums and simplify Tilt ignore patterns

- Map @igniter/db/provider/enums to correct schema/enums path in jest
- Revert fixture import to @igniter/db/provider/enums (valid for tsc)
- Simplify Tilt ignore across all 4 apps: **/dist, **/*.test.ts, **/__fixtures__
- Prevents local test/build from triggering Tilt rebuild cycles

* Fix false-positive success in remediateSupplier (#243)

* Update `remediateSupplier` to use `ApplicationFailure` exceptions for improved error handling and retry logic

* Update `upsertSupplierStatus` to return detailed result with state, remediation reasons, and supplier context

* Normalize domains to second-level structure in activity handler (#252)

* Fix text contrast in gradient-border status pills (#254)

White text on all gradient-border pills (green, orange, purple, slate)
for readable contrast. Affects node detail, transaction detail, stake,
unstake, and import success views in the middleman app.

* Run docker builds only after quality checks pass (#255)

* Address PR #253 review feedback and fix service comparison divergence (#256)

Fix service comparison divergence (root cause of supplier re-staking loop):
- Filter 0% revShare entries in getExpectedServicesFromKey to match
  BuildSupplierServiceConfigHandler behavior
- Normalize rpcType to numeric via RPCTypeMap (prevents string vs
  number false mismatch)
- Use stakeOwner (from chain) as authoritative owner address, falling
  back to ownerAddress for backwards compatibility
- Update tests to verify new correct behavior

* chore(deploy): update staging image to a0e0afb

---------

Co-authored-by: Jorge S. Cuesta <jorge.s.cuesta@gmail.com>
Co-authored-by: Alan Rojas <alan2rm7@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants