Skip to content

Conversation

@zaneb
Copy link
Member

@zaneb zaneb commented Nov 23, 2025

Fix a problem where a partial failure of creating the Cluster object in the agent-based installer client could result in an inconsistent cluster config.

Because the client exits on failure and relies on systemd to restart it, it effectively operates like a distributed system. Since the cluster creation has 3 steps - creating the Cluster object, applying the install-config overrides, and adding each additional manifest - we must retry idempotently if any of these steps fail. This was not happening previously: any failure after the first step would result in no retries, as the new instance of the client would see that the Cluster exists and not continue with the other operations. This could result in us progressing to install a cluster with only part of the configuration supplied by the user applied.

This change fixes that so that we always either eventually apply the full config as provided or never progress.

List all the issues related to this PR

  • OCPBUGS-56913

  • New Feature

  • Enhancement

  • Bug fix

  • Tests

  • Documentation

  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

/jira cherrypick OCPBUGS-56913

A failure to do a ListClusters call is different from successfully doing
the call and finding that there are no clusters registered. Handle the
two cases separately.
Extract the installConfig override application logic into a separate,
reusable function ApplyInstallConfigOverrides(). This preserves existing
behavior where overrides are applied within RegisterCluster(), but makes
the logic testable and reusable.

Includes comprehensive unit tests covering:
- Applying overrides to cluster without overrides
- Idempotent behavior when overrides already applied
- Re-applying when overrides differ from manifest
- Error handling for API failures
- Handling clusters without override annotations
- Validation of manifest file errors
- Normalization of JSON with different whitespace
- Normalization of JSON with different key ordering
- Handling of empty strings
- Error handling for invalid JSON in new overrides
- Recovery from invalid JSON in existing cluster overrides
- Consistency of normalization output

Assisted-by: Claude Code
Make RegisterExtraManifests idempotent by checking for existing
manifests before attempting to create them. This prevents failures when
the registration process is retried (e.g., after a service restart).

Add comprehensive unit tests that verify:
- Creating new manifests when none exist
- Skipping manifests with identical content
- Returning error when content differs
- Full idempotency across multiple calls
- Proper error handling for API failures

This ensures safe retry of the registration process.

Assisted-by: Claude Code
Fix the bug where installConfig overrides and extra manifests are not
applied when the service restarts after finding an existing cluster.

Previously, the registerCluster() function would immediately return if
a cluster already existed, skipping the steps to apply installConfig
overrides and register extra manifests. This meant that if the service
crashed or was restarted after cluster registration but before these
steps completed, the configuration would be incomplete.

Now, registerCluster() unconditionally calls both
ApplyInstallConfigOverrides() and RegisterExtraManifests() after
obtaining the cluster (whether newly created or existing). Since both
functions are idempotent, this is safe to retry and ensures all
configuration steps complete successfully.

Add subsystem tests to verify:
- Retry of installConfig overrides on restart (idempotent)
- Application of missing overrides to existing cluster
- Retry of extra manifest registration (idempotent)

Assisted-by: Claude Code
@coderabbitai
Copy link

coderabbitai bot commented Nov 23, 2025

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 23, 2025
@openshift-ci openshift-ci bot requested review from bfournie and gamli75 November 23, 2025 22:15
@zaneb
Copy link
Member Author

zaneb commented Nov 23, 2025

/jira cherrypick OCPBUGS-56913

@openshift-ci-robot
Copy link

@zaneb: Jira Issue OCPBUGS-56913 has been cloned as Jira Issue OCPBUGS-65901. Will retitle bug to link to clone.
/retitle OCPBUGS-65901: Retry incomplete cluster registration in ABI

Details

In response to this:

/jira cherrypick OCPBUGS-56913

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot changed the title Retry incomplete cluster registration in ABI OCPBUGS-65901: Retry incomplete cluster registration in ABI Nov 23, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 23, 2025
@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-65901, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected dependent Jira Issue OCPBUGS-56913 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Fix a problem where a partial failure of creating the Cluster object in the agent-based installer client could result in an inconsistent cluster config.

Because the client exits on failure and relies on systemd to restart it, it effectively operates like a distributed system. Since the cluster creation has 3 steps - creating the Cluster object, applying the install-config overrides, and adding each additional manifest - we must retry idempotently if any of these steps fail. This was not happening previously: any failure after the first step would result in no retries, as the new instance of the client would see that the Cluster exists and not continue with the other operations. This could result in us progressing to install a cluster with only part of the configuration supplied by the user applied.

This change fixes that so that we always either eventually apply the full config as provided or never progress.

List all the issues related to this PR

  • OCPBUGS-56913

  • New Feature

  • Enhancement

  • Bug fix

  • Tests

  • Documentation

  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

/jira cherrypick OCPBUGS-56913

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@zaneb
Copy link
Member Author

zaneb commented Nov 23, 2025

/cherry-pick release-4.19

@openshift-cherrypick-robot

@zaneb: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zaneb
Copy link
Member Author

zaneb commented Nov 24, 2025

/retest

@gamli75
Copy link
Contributor

gamli75 commented Nov 24, 2025

/retest-required

@openshift-ci
Copy link

openshift-ci bot commented Nov 24, 2025

@zaneb: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bfournie
Copy link
Contributor

bfournie commented Dec 2, 2025

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Dec 2, 2025
@zaneb
Copy link
Member Author

zaneb commented Dec 8, 2025

/jira refresh

@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-65901, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected dependent Jira Issue OCPBUGS-56913 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@zaneb
Copy link
Member Author

zaneb commented Dec 15, 2025

/jira refresh

@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-65901, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@zaneb
Copy link
Member Author

zaneb commented Dec 15, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 15, 2025
@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-65901, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.z) matches configured target version for branch (4.20.z)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-56913 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-56913 targets the "4.21.0" version, which is one of the valid target versions: 4.21.0
  • bug has dependents

Requesting review from QA contact:
/cc @mhanss

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from mhanss December 15, 2025 23:14
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gamli75, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2025
@zaneb
Copy link
Member Author

zaneb commented Dec 16, 2025

/cc @zniu1011

@openshift-ci openshift-ci bot requested a review from zniu1011 December 16, 2025 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants