Describe the bug
When executing a setCanaryScale step mid-rollout (after traffic has already been split between stable and canary via setWeight), traffic momentarily shifts entirely to the stable ReplicaSet (100:0) before the next setWeight step corrects it.
This occurs even when:
- All canary pods are running and ready
- Endpoints are healthy and unchanged
- Pod count is sufficient
Expected Behavior
When setCanaryScale: weight: 100 is executed after setWeight: 50, the traffic distribution should remain at 50:50 (stable:canary) as defined by the last setWeight step. setCanaryScale should only affect pod scaling, not traffic routing.
Actual Behavior
Traffic shifts to approximately 100:0 (all traffic to stable) during the setCanaryScale step execution, then corrects to the expected ratio at the next setWeight step.
The graph below shows the traffic spike to stable (green line surges to ~2000, yellow/canary drops to ~0) during the setCanaryScale step:
Steps to Reproduce
- Configure a canary Rollout with Istio traffic routing
- Use the following canary steps:
steps:
- setCanaryScale:
weight: 50
- setWeight: 10
- analysis: ...
- setWeight: 25
- pause:
duration: 30s
- setWeight: 50
- setCanaryScale:
weight: 100 # <-- traffic shifts to stable here
- setWeight: 75
- pause:
duration: 30s
- setWeight: 100
- Deploy a new version and observe traffic distribution at step 7 (
setCanaryScale: weight: 100)
Hypothesis
The setCanaryScale step is a non-blocking step that completes immediately. During the reconciliation cycle when setCanaryScale is processed, the controller may be inadvertently reconciling the Istio VirtualService weights, causing a momentary traffic shift before the next setWeight step is reached.
The GetCurrentSetWeight function walks backward from the current step index to find the last setWeight value. If the step index advances past setCanaryScale before the traffic routing reconciler uses the correct weight, the VirtualService may temporarily receive an incorrect weight value.
Workaround
Replace setCanaryScale: weight: 100 with setCanaryScale: matchTrafficWeight: true, which ties pod scaling to setWeight and avoids the issue:
steps:
- setCanaryScale:
weight: 50
- setWeight: 10
- analysis: ...
- setWeight: 25
- pause:
duration: 30s
- setWeight: 50
- setCanaryScale:
matchTrafficWeight: true
- setWeight: 75
- pause:
duration: 30s
- setWeight: 100
Note: This workaround sacrifices the ability to pre-scale canary pods before increasing traffic.
Environment
- Argo Rollouts version: v.1.8.2
- Kubernetes version: 1.34
- Traffic Router: Istio
Describe the bug
When executing a
setCanaryScalestep mid-rollout (after traffic has already been split between stable and canary viasetWeight), traffic momentarily shifts entirely to the stable ReplicaSet (100:0) before the nextsetWeightstep corrects it.This occurs even when:
Expected Behavior
When
setCanaryScale: weight: 100is executed aftersetWeight: 50, the traffic distribution should remain at 50:50 (stable:canary) as defined by the lastsetWeightstep.setCanaryScaleshould only affect pod scaling, not traffic routing.Actual Behavior
Traffic shifts to approximately 100:0 (all traffic to stable) during the
setCanaryScalestep execution, then corrects to the expected ratio at the nextsetWeightstep.The graph below shows the traffic spike to stable (green line surges to ~2000, yellow/canary drops to ~0) during the
setCanaryScalestep:Steps to Reproduce
setCanaryScale: weight: 100)Hypothesis
The
setCanaryScalestep is a non-blocking step that completes immediately. During the reconciliation cycle whensetCanaryScaleis processed, the controller may be inadvertently reconciling the Istio VirtualService weights, causing a momentary traffic shift before the nextsetWeightstep is reached.The
GetCurrentSetWeightfunction walks backward from the current step index to find the lastsetWeightvalue. If the step index advances pastsetCanaryScalebefore the traffic routing reconciler uses the correct weight, the VirtualService may temporarily receive an incorrect weight value.Workaround
Replace
setCanaryScale: weight: 100withsetCanaryScale: matchTrafficWeight: true, which ties pod scaling tosetWeightand avoids the issue:Note: This workaround sacrifices the ability to pre-scale canary pods before increasing traffic.
Environment