Skip to content

Prevent 503s during final canary step with AWS ALB #4433

@ajax-koval-d

Description

@ajax-koval-d

Summary

Problem definition

We have a setup with Argo Rollouts, AWS ALB controller and canary deployment with trafficRouting enabled. During releases we observe 503 errors generated by AWS load balancers, which indicate that target group does not have any healthy targets. This problem is essentially the same as the one documented for blue green deployments: https://github.com/argoproj/argo-rollouts/pull/4259/files

It happens after canary steps are finished and stable service selector is switched to new replica set.

We are aware that ping pong deployment strategy might fix this issue, but this is not an optimal option for us since we want to have a fixed preview URL.

Attaching Argo Rollouts logs for a release where 503s occurred.

canary503.txt

The points of interest are:

Switching stable service selector - "Aug 29, 2025 @ 06:41:45.605","Switched selector for service 'example-service-active' from '56b5dc5bc' to '7fdb8fff6b'"
Updating target weights to set 100% to stable - "Aug 29, 2025 @ 06:41:45.628","New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:example-service-preview,PodTemplateHash:7fdb8fff6b,},Stable:WeightDestination{Weight:100,ServiceName:example-service-active,PodTemplateHash:7fdb8fff6b,},Additional:[]WeightDestination{},Verified:nil,}"
Argo confirms that targets are registered in stable target group - "Aug 29, 2025 @ 06:41:55.903","Service example-service-active (TargetGroup active_target_group(replaced_real_arn)) verified: 2 endpoints registered"

503 errors where observed between 06:41:45 and 06:41:55

Proposal

On top of switching service selector (which causes target registrations in AWS), canary deployment also switches all traffic back to stable service after canary deployment steps are complete. The issue here is that traffic switch happens faster than targets registration.

Argo Rollouts already implements logic that is capable of verifying that pods from new RS are registered in AWS target group.

The proposal is simple - use that functionality to verify that stable target group is healthy before switching back all traffic to it. So the general flow would look like this:

  1. Canary steps are finished (100% of traffic is on canary service)
  2. Service selector for stable switched to new RS
  3. Argo Verifies that pods from new RS are registered in stable target group
  4. Argo switches 100% traffic to stable selector

Would it be technically possible to achieve such behaviour?

Use Cases

We would like to use this functionality to achieve real zero downtime deployment while still having a dedicated preview/canary endpoint for manual testing during release.


Message from the maintainers:

Need this enhancement? Give it a 👍. We prioritize the issues with the most 👍.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions