Prevent 503s during final canary step with AWS ALB

# Summary

## Problem definition

We have a setup with Argo Rollouts, AWS ALB controller and canary deployment with trafficRouting enabled. During releases we observe 503 errors generated by AWS load balancers, which indicate that target group does not have any healthy targets. This problem is essentially the same as the one documented for blue green deployments: https://github.com/argoproj/argo-rollouts/pull/4259/files

It happens after canary steps are finished and `stable` service selector is switched to new replica set.

We are aware that ping pong deployment strategy might fix this issue, but this is not an optimal option for us since we want to have a fixed preview URL.

Attaching Argo Rollouts logs for a release where 503s occurred. 

[canary503.txt](https://github.com/user-attachments/files/22114442/canary503.txt)

The points of interest are: 

```
Switching stable service selector - "Aug 29, 2025 @ 06:41:45.605","Switched selector for service 'example-service-active' from '56b5dc5bc' to '7fdb8fff6b'"
Updating target weights to set 100% to stable - "Aug 29, 2025 @ 06:41:45.628","New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:example-service-preview,PodTemplateHash:7fdb8fff6b,},Stable:WeightDestination{Weight:100,ServiceName:example-service-active,PodTemplateHash:7fdb8fff6b,},Additional:[]WeightDestination{},Verified:nil,}"
Argo confirms that targets are registered in stable target group - "Aug 29, 2025 @ 06:41:55.903","Service example-service-active (TargetGroup active_target_group(replaced_real_arn)) verified: 2 endpoints registered"
```
503 errors where observed between 06:41:45 and 06:41:55

## Proposal

On top of switching service selector (which causes target registrations in AWS), canary deployment also switches all traffic back to `stable` service after canary deployment steps are complete. The issue here is that traffic switch happens faster than targets registration.

Argo Rollouts [already implements](https://github.com/argoproj/argo-rollouts/blob/6cf05d9eecf7cbf393bc2f68559480e1283f2850/utils/aws/aws.go#L353) logic that is capable of verifying that pods from new RS are registered in AWS target group. 

The proposal is simple - use that functionality to verify that `stable` target group is healthy before switching back all traffic to it. So the general flow would look like this:

1. Canary steps are finished (100% of traffic is on `canary` service)
2. Service selector for `stable` switched to new RS
3. Argo Verifies that pods from new RS are registered in `stable` target group
4. Argo switches 100% traffic to `stable` selector

Would it be technically possible to achieve such behaviour?

# Use Cases

We would like to use this functionality to achieve real zero downtime deployment while still having a dedicated `preview/canary` endpoint for manual testing during release.

---

**Message from the maintainers**:

Need this enhancement? Give it a 👍. We prioritize the issues with the most 👍.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent 503s during final canary step with AWS ALB #4433

Summary

Problem definition

Proposal

Use Cases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prevent 503s during final canary step with AWS ALB #4433

Description

Summary

Problem definition

Proposal

Use Cases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions