Skip to content

Conversation

@andrewjamesbrown
Copy link
Contributor

@andrewjamesbrown andrewjamesbrown commented Dec 12, 2025

Fixes #4390 and #4534

The isReplicaSetReferenced function checks:

  • Static references in rollout status (StableRS, CurrentPodHash, Weights.Canary.PodTemplateHash, Weights.Stable.PodTemplateHash)
  • Service selectors for Canary/BlueGreen services

However, when using Istio with subset-level traffic splitting (DestinationRules), the function doesn't check if the ReplicaSet's pod template hash is still referenced in the Istio DestinationRule subsets.
This means a ReplicaSet can be scaled down even though Istio is still routing traffic to that subset.

We see these logs:

Skip scale down of older RS 'service-5f68597bd5': still referenced
Patched: {"status":{"availableReplicas":367,"canary":{"weights":{"canary":{"podTemplateHash":"fb447bcd9"}}},"readyReplicas":367}}
delaying destination rule switch: ReplicaSet service-88dd777fc not fully available
scaling down intermediate RS 'service-5f68597bd5'
Scaled down ReplicaSet service-5f68597bd5 (revision 1823) from 139 to 0

At which point Istio starts attempting to send to the scaled-down replica and starts returning HTTP 503 with a UH (no healthy upstreams) error.

The issue is that the metadata is being updated, but since we do not use services (only istio subsets), trafficrouting cannot detect that the "intermediate RS" is still in use.


When using Istio with DestinationRule subsets (without explicit canary/stable services), the isReplicaSetReferenced function doesn't check the DestinationRule to see if the pod template hash is still referenced in a subset.

The problem is:

  • The rollout status (e.g., Weights.Canary.PodTemplateHash) gets updated to the new hash
  • But the Istio DestinationRule subsets still have the old hash because UpdateHash hasn't been called yet (or returns early due to checkReplicasAvailable)
  • The ReplicaSet with the old hash gets scaled down because isReplicaSetReferenced returns false
  • Traffic is still being routed to that subset, causing 503s

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@andrewjamesbrown andrewjamesbrown changed the title fix(replicaset): ReplicaSetReferenced does not check istio DestinationRules fix(trafficrouting): ReplicaSetReferenced does not check istio DestinationRules Dec 12, 2025
@andrewjamesbrown andrewjamesbrown force-pushed the ajb/replicset_istio_dr branch 4 times, most recently from a9e6e72 to c31245a Compare December 12, 2025 21:50
@github-actions
Copy link
Contributor

github-actions bot commented Dec 12, 2025

Published E2E Test Results

  4 files    4 suites   3h 24m 52s ⏱️
117 tests 108 ✅  7 💤 2 ❌
470 runs  440 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit 977a414.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 12, 2025

Published Unit Test Results

2 384 tests   2 384 ✅  3m 8s ⏱️
  129 suites      0 💤
    1 files        0 ❌

Results for commit 977a414.

♻️ This comment has been updated with latest results.

@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.42%. Comparing base (c6dfed2) to head (977a414).

Files with missing lines Patch % Lines
rollout/replicaset.go 94.28% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4560      +/-   ##
==========================================
+ Coverage   84.40%   84.42%   +0.01%     
==========================================
  Files         164      164              
  Lines       18849    18884      +35     
==========================================
+ Hits        15909    15942      +33     
- Misses       2077     2079       +2     
  Partials      863      863              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@andrewjamesbrown
Copy link
Contributor Author

@zachaller I'm not sure the release/review process for this repo - any thoughts on how to get it closer to being merged? We've been using this on an internal build for the past ~month and it has resolved our issue, so I'm keen to get this into a release so we can start using upstream again instead of our fork. Thanks!

@andrewjamesbrown
Copy link
Contributor Author

@zachaller friendly ping on this one too :-)

@andrewjamesbrown andrewjamesbrown force-pushed the ajb/replicset_istio_dr branch 2 times, most recently from b37aeca to d8e4b15 Compare January 26, 2026 17:20
The isReplicaSetReferenced function checks:
- Static references in rollout status (StableRS, CurrentPodHash, Weights.Canary.PodTemplateHash, Weights.Stable.PodTemplateHash)
- Service selectors for Canary/BlueGreen services

However, when using Istio with subset-level traffic splitting (DestinationRules),
the function doesn't check if the ReplicaSet's pod template hash is still
referenced in the Istio DestinationRule subsets.
This means a ReplicaSet can be scaled down even though Istio is still routing
traffic to that subset.

When using Istio with DestinationRule subsets (without explicit canary/stable services), the isReplicaSetReferenced function doesn't check the DestinationRule to see if the pod template hash is still referenced in a subset.

The problem is:
- The rollout status (e.g., Weights.Canary.PodTemplateHash) gets updated to the new hash
- But the Istio DestinationRule subsets still have the old hash because UpdateHash hasn't been called yet (or returns early due to checkReplicasAvailable)
- The ReplicaSet with the old hash gets scaled down because isReplicaSetReferenced returns false
- Traffic is still being routed to that subset, causing 503s

Signed-off-by: Andrew Brown <[email protected]>
Signed-off-by: Andrew Brown <[email protected]>
Signed-off-by: Andrew Brown <[email protected]>
Signed-off-by: Andrew Brown <[email protected]>
Signed-off-by: Andrew Brown <[email protected]>
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

If stable is not ready a new canary can cause traffic to be still on old RS while its being scaled down

2 participants