Skip to content

Traffic unexpectedly shifts to stable during mid-rollout setCanaryScale step with Istio #4642

@myeongseok-rpls

Description

@myeongseok-rpls

Describe the bug

When executing a setCanaryScale step mid-rollout (after traffic has already been split between stable and canary via setWeight), traffic momentarily shifts entirely to the stable ReplicaSet (100:0) before the next setWeight step corrects it.

This occurs even when:

  • All canary pods are running and ready
  • Endpoints are healthy and unchanged
  • Pod count is sufficient

Expected Behavior

When setCanaryScale: weight: 100 is executed after setWeight: 50, the traffic distribution should remain at 50:50 (stable:canary) as defined by the last setWeight step. setCanaryScale should only affect pod scaling, not traffic routing.

Actual Behavior

Traffic shifts to approximately 100:0 (all traffic to stable) during the setCanaryScale step execution, then corrects to the expected ratio at the next setWeight step.

The graph below shows the traffic spike to stable (green line surges to ~2000, yellow/canary drops to ~0) during the setCanaryScale step:

Image

Steps to Reproduce

  1. Configure a canary Rollout with Istio traffic routing
  2. Use the following canary steps:
steps:
  - setCanaryScale:
      weight: 50
  - setWeight: 10
  - analysis: ...
  - setWeight: 25
  - pause:
      duration: 30s
  - setWeight: 50
  - setCanaryScale:
      weight: 100    # <-- traffic shifts to stable here
  - setWeight: 75
  - pause:
      duration: 30s
  - setWeight: 100
  1. Deploy a new version and observe traffic distribution at step 7 (setCanaryScale: weight: 100)

Hypothesis

The setCanaryScale step is a non-blocking step that completes immediately. During the reconciliation cycle when setCanaryScale is processed, the controller may be inadvertently reconciling the Istio VirtualService weights, causing a momentary traffic shift before the next setWeight step is reached.

The GetCurrentSetWeight function walks backward from the current step index to find the last setWeight value. If the step index advances past setCanaryScale before the traffic routing reconciler uses the correct weight, the VirtualService may temporarily receive an incorrect weight value.

Workaround

Replace setCanaryScale: weight: 100 with setCanaryScale: matchTrafficWeight: true, which ties pod scaling to setWeight and avoids the issue:

steps:
  - setCanaryScale:
      weight: 50
  - setWeight: 10
  - analysis: ...
  - setWeight: 25
  - pause:
      duration: 30s
  - setWeight: 50
  - setCanaryScale:
      matchTrafficWeight: true
  - setWeight: 75
  - pause:
      duration: 30s
  - setWeight: 100

Note: This workaround sacrifices the ability to pre-scale canary pods before increasing traffic.

Environment

  • Argo Rollouts version: v.1.8.2
  • Kubernetes version: 1.34
  • Traffic Router: Istio

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions