-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
Problem definition
We have a setup with Argo Rollouts, AWS ALB controller and canary deployment with trafficRouting enabled. During releases we observe 503 errors generated by AWS load balancers, which indicate that target group does not have any healthy targets. This problem is essentially the same as the one documented for blue green deployments: https://github.com/argoproj/argo-rollouts/pull/4259/files
It happens after canary steps are finished and stable service selector is switched to new replica set.
We are aware that ping pong deployment strategy might fix this issue, but this is not an optimal option for us since we want to have a fixed preview URL.
Attaching Argo Rollouts logs for a release where 503s occurred.
The points of interest are:
Switching stable service selector - "Aug 29, 2025 @ 06:41:45.605","Switched selector for service 'example-service-active' from '56b5dc5bc' to '7fdb8fff6b'"
Updating target weights to set 100% to stable - "Aug 29, 2025 @ 06:41:45.628","New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:example-service-preview,PodTemplateHash:7fdb8fff6b,},Stable:WeightDestination{Weight:100,ServiceName:example-service-active,PodTemplateHash:7fdb8fff6b,},Additional:[]WeightDestination{},Verified:nil,}"
Argo confirms that targets are registered in stable target group - "Aug 29, 2025 @ 06:41:55.903","Service example-service-active (TargetGroup active_target_group(replaced_real_arn)) verified: 2 endpoints registered"
503 errors where observed between 06:41:45 and 06:41:55
Proposal
On top of switching service selector (which causes target registrations in AWS), canary deployment also switches all traffic back to stable service after canary deployment steps are complete. The issue here is that traffic switch happens faster than targets registration.
Argo Rollouts already implements logic that is capable of verifying that pods from new RS are registered in AWS target group.
The proposal is simple - use that functionality to verify that stable target group is healthy before switching back all traffic to it. So the general flow would look like this:
- Canary steps are finished (100% of traffic is on
canaryservice) - Service selector for
stableswitched to new RS - Argo Verifies that pods from new RS are registered in
stabletarget group - Argo switches 100% traffic to
stableselector
Would it be technically possible to achieve such behaviour?
Use Cases
We would like to use this functionality to achieve real zero downtime deployment while still having a dedicated preview/canary endpoint for manual testing during release.
Message from the maintainers:
Need this enhancement? Give it a 👍. We prioritize the issues with the most 👍.