-
Notifications
You must be signed in to change notification settings - Fork 247
OCPBUGS-65901: Retry incomplete cluster registration in ABI #8429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-4.20
Are you sure you want to change the base?
OCPBUGS-65901: Retry incomplete cluster registration in ABI #8429
Conversation
A failure to do a ListClusters call is different from successfully doing the call and finding that there are no clusters registered. Handle the two cases separately.
Extract the installConfig override application logic into a separate, reusable function ApplyInstallConfigOverrides(). This preserves existing behavior where overrides are applied within RegisterCluster(), but makes the logic testable and reusable. Includes comprehensive unit tests covering: - Applying overrides to cluster without overrides - Idempotent behavior when overrides already applied - Re-applying when overrides differ from manifest - Error handling for API failures - Handling clusters without override annotations - Validation of manifest file errors - Normalization of JSON with different whitespace - Normalization of JSON with different key ordering - Handling of empty strings - Error handling for invalid JSON in new overrides - Recovery from invalid JSON in existing cluster overrides - Consistency of normalization output Assisted-by: Claude Code
Make RegisterExtraManifests idempotent by checking for existing manifests before attempting to create them. This prevents failures when the registration process is retried (e.g., after a service restart). Add comprehensive unit tests that verify: - Creating new manifests when none exist - Skipping manifests with identical content - Returning error when content differs - Full idempotency across multiple calls - Proper error handling for API failures This ensures safe retry of the registration process. Assisted-by: Claude Code
Fix the bug where installConfig overrides and extra manifests are not applied when the service restarts after finding an existing cluster. Previously, the registerCluster() function would immediately return if a cluster already existed, skipping the steps to apply installConfig overrides and register extra manifests. This meant that if the service crashed or was restarted after cluster registration but before these steps completed, the configuration would be incomplete. Now, registerCluster() unconditionally calls both ApplyInstallConfigOverrides() and RegisterExtraManifests() after obtaining the cluster (whether newly created or existing). Since both functions are idempotent, this is safe to retry and ensures all configuration steps complete successfully. Add subsystem tests to verify: - Retry of installConfig overrides on restart (idempotent) - Application of missing overrides to existing cluster - Retry of extra manifest registration (idempotent) Assisted-by: Claude Code
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
/jira cherrypick OCPBUGS-56913 |
|
@zaneb: Jira Issue OCPBUGS-56913 has been cloned as Jira Issue OCPBUGS-65901. Will retitle bug to link to clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@zaneb: This pull request references Jira Issue OCPBUGS-65901, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.19 |
|
@zaneb: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
/retest-required |
|
@zaneb: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/label backport-risk-assessed |
|
/jira refresh |
|
@zaneb: This pull request references Jira Issue OCPBUGS-65901, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@zaneb: This pull request references Jira Issue OCPBUGS-65901, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@zaneb: This pull request references Jira Issue OCPBUGS-65901, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gamli75, zaneb The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cc @zniu1011 |
Fix a problem where a partial failure of creating the Cluster object in the agent-based installer client could result in an inconsistent cluster config.
Because the client exits on failure and relies on systemd to restart it, it effectively operates like a distributed system. Since the cluster creation has 3 steps - creating the Cluster object, applying the install-config overrides, and adding each additional manifest - we must retry idempotently if any of these steps fail. This was not happening previously: any failure after the first step would result in no retries, as the new instance of the client would see that the Cluster exists and not continue with the other operations. This could result in us progressing to install a cluster with only part of the configuration supplied by the user applied.
This change fixes that so that we always either eventually apply the full config as provided or never progress.
List all the issues related to this PR
OCPBUGS-56913
New Feature
Enhancement
Bug fix
Tests
Documentation
CI/CD
What environments does this code impact?
How was this code tested?
Checklist
docs, README, etc)Reviewers Checklist
/jira cherrypick OCPBUGS-56913