Skip to content

feat(feedback): only update gateway scores for valid PSP-side errors, not client/merchant errors #237

@tinu-hareesswar

Description

@tinu-hareesswar

Context

Gateway scoring in the decision engine currently runs whenever update-gateway-score is called with a failed TxnStatus. The scoring-type decision lives in get_gateway_scoring_type in src/feedback/gateway_scoring_service.rs (around L293) and is effectively:

if is_success {
    GatewayScoringType::Reward
} else if is_failure {
    GatewayScoringType::PenaliseSrv3
} else if time_difference < threshold {
    GatewayScoringType::Penalise
} else {
    GatewayScoringType::PenaliseSrv3
}

There is no inspection of why the transaction failed. Every failure — whether the gateway was down, the customer entered a wrong CVV, the merchant sent an invalid request, or the card issuer declined for insufficient funds — penalises the gateway's SR / elimination / latency scores identically.

This is incorrect: scoring should reflect PSP health, and client-side / merchant-side / issuer-side failures are not signals that the PSP is unhealthy. Penalising on these errors causes:

  • Healthy gateways to be demoted in rankings after a spike of merchant-side errors (e.g. a buggy integration sending malformed requests).
  • Routing to oscillate away from a perfectly good gateway during a customer-behavior spike (e.g. a marketing push causing lots of declined cards).
  • Scoring noise that makes SR-based routing less predictable.

Motivation

Gateway scores should be a trustworthy signal of gateway health. A gateway must only be penalised when it was genuinely at fault.

Proposal

This builds on the Gateway Status Mapping (GSM) work (see the related issue adding a GSM table). With GSM in place, each (gateway, error_code, error_message) tuple has a canonical classification. We should use that classification to filter scoring updates.

  1. Classify errors into at least these buckets in the GSM (or via a new error_category column):

    • PSP_FAULT — connector / gateway failure, timeout, 5xx, acquirer down. Score update: yes.
    • ISSUER_DECLINE — card declined, insufficient funds, do-not-honor. Score update: no (or separate signal).
    • CLIENT_ERROR — bad CVV, wrong OTP, user cancelled, expired card. Score update: no.
    • MERCHANT_ERROR — invalid request, bad signature, misconfigured MID. Score update: no.
    • UNKNOWN — no mapping found. Score update: configurable default (recommend: yes, but log loudly so mappings can be added).
  2. Plumb error_code / error_message through the feedback path from UpdateScorePayload into get_gateway_scoring_type and update_gateway_score in src/feedback/gateway_scoring_service.rs.

  3. Short-circuit scoring in check_and_update_gateway_score_ when the classified category is not PSP_FAULT (and not UNKNOWN treated as PSP). Return a success response so callers still get their retry decision, but skip the Redis score mutations.

  4. Metrics: add a counter broken down by category so operators can see how many feedback calls were filtered out and why.

  5. Config knob: a merchant-level (and global-default) setting to choose strict mode (only PSP_FAULT updates scores) vs lenient mode (current behavior, for backwards compatibility during rollout).

Acceptance Criteria

  • get_gateway_scoring_type (or a new wrapper) takes the error classification into account and returns a no-op variant when the failure is not the PSP's fault.
  • Gateway SR / elimination / latency scores are unchanged when a failure is classified as CLIENT_ERROR, MERCHANT_ERROR, or ISSUER_DECLINE.
  • PSP_FAULT failures continue to penalise the gateway exactly as today.
  • A new Prometheus counter exposes filtered-vs-applied scoring updates by category.
  • A merchant-level config flag can opt-in/out of strict filtering, with the default chosen for a safe rollout.
  • Unit tests cover each category → expected scoring action.
  • Docs updated: updateGatewayScore.mdx explains that classification is driven by the GSM table and that unmapped codes behave per the default policy.

Dependencies

  • Builds on the GSM table work — this issue should land after (or alongside) the GSM issue so that error classification has a canonical source.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions