Fix OpenStackVersion minor update workflow hanging on OVN dataplane check#1823
Fix OpenStackVersion minor update workflow hanging on OVN dataplane check#1823stuggi wants to merge 2 commits intoopenstack-k8s-operators:mainfrom
Conversation
…heck During a minor update, the OpenStackVersion controller was getting stuck with incorrect condition states even after deployments completed: - MinorUpdateOVNDataplane: "in progress" (stuck) - MinorUpdateControlplane: "not started" (never executed) Root Cause: The DataplaneNodesetsOVNControllerImagesMatch function checked nodeset.IsReady(), which failed when subsequent deployments (e.g., edpm-update) started running. This caused the function to return false even though the OVN update deployment (edpm-ovn-update) had already completed successfully. The nodeset's overall Ready status was False because edpm-update was running, blocking the minor update workflow from progressing to the next steps (RabbitMQ, MariaDB, controlplane services, etc.). Solution: Remove the nodeset.IsReady() check from DataplaneNodesetsOVNControllerImagesMatch. The nodeset's Status.ContainerImages["OvnControllerImage"] is only updated when a deployment completes successfully (openstackdataplanenodeset_controller.go:598-600). Therefore, if the OVN image matches the target version, the OVN update deployment has already completed, regardless of the nodeset's overall Ready status. Why we can't check deployment-specific conditions: The nodeset stores deployment conditions in Status.DeploymentStatuses map, keyed by deployment name (e.g., "edpm-ovn-update"). However, deployment names are dynamic and not known at this point in the code, making it impossible to check specific deployment conditions directly. Note: The final DataplaneNodesetsDeployed check still uses nodeset.IsReady() because it validates the completion of the entire minor update workflow, where we do want to ensure the nodeset is fully ready. Jira: OSPRH-25860 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stuggi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
abays
left a comment
There was a problem hiding this comment.
+1 from me, but will defer to EDPM team for final approval
|
for reference, seen in https://softwarefactory-project.io/zuul/t/rdoproject.org/build/114bae0a80b24333897877431db93b5e where osversion conditions show https://logserver.rdoproject.org/114/rdoproject.org/114bae0a80b24333897877431db93b5e/ci-framework-data/logs/openstack-must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-1663223401da21aaf5f4d8d59101f0d65b9026b7fe7df6e028cefcd367dd2d15/namespaces/openstack/crs/openstackversions.core.openstack.org/controlplane.yaml while edpm ovn + ctlplane update finished and the general edpm-update deployment is stuck because of:
|
|
@rabi if you have some time, could you please review this. Also should/can we run the openstack-baremetal-operator-edpm-baremetal-minor-update on the openstack-operator? |
Yeah we can add the update job in openstack-operator (It's now in edpm-ansible and openstack-baremetal-operator repos). However, I'm thinking is this hiding some other issue. We mark the nodeset ready after ovn-update deployment completes and don't run ovn-update till OpenstackVersion is updated. Is further reconciliation of OpenStackVersion causing this issue? |
|
the issue with the current code in [1] is that when the later deployment runs to update the remaining services on the edpm node and the nodeset is no longer ready, the ovnupdate condition because of [1] also returns to not ready and reflects a wrong state. I'll test adding the the job as part of this PR to see it running |
Did something change recently? We've the update job running and working fine for a long time https://logserver.rdoproject.org/2d2/rdoproject.org/2d2e39ca37534070a299ee91193de620/ |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/05f00347ab5248479af3e12d456fb725 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 53m 49s |
not that I am aware of. What this PR fixes is not fixing an functional bug. If the edpm-update deployment finishes it would reflect it ok. its just that during that edpm-update run the conditions are not reflecting the previous already run edpm OVN update deployment ran. |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/08f217cc8de14aada95df0a76f569521 ✔️ openstack-k8s-operators-content-provider SUCCESS in 29m 08s |
|
recheck |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a09cacd938f64805b3ba1e14d6ecbe21 ✔️ openstack-k8s-operators-content-provider SUCCESS in 28m 32s |
Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
12c5ce1 to
88e83d0
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/1efc4f32f27c44e0903d1073c2a6d2b1 ✔️ openstack-k8s-operators-content-provider SUCCESS in 28m 06s |
|
@stuggi: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
During a minor update, the OpenStackVersion controller was getting stuck with incorrect condition states even after deployments completed:
Root Cause:
The DataplaneNodesetsOVNControllerImagesMatch function checked nodeset.IsReady(), which failed when subsequent deployments (e.g., edpm-update) started running. This caused the function to return false even though the OVN update deployment (edpm-ovn-update) had already completed successfully.
The nodeset's overall Ready status was False because edpm-update was running, blocking the minor update workflow from progressing to the next steps (RabbitMQ, MariaDB, controlplane services, etc.).
Solution:
Remove the nodeset.IsReady() check from DataplaneNodesetsOVNControllerImagesMatch. The nodeset's Status.ContainerImages["OvnControllerImage"] is only updated when a deployment completes successfully (openstackdataplanenodeset_controller.go:598-600). Therefore, if the OVN image matches the target version, the OVN update deployment has already completed, regardless of the nodeset's overall Ready status.
Why we can't check deployment-specific conditions: The nodeset stores deployment conditions in Status.DeploymentStatuses map, keyed by deployment name (e.g., "edpm-ovn-update"). However, deployment names are dynamic and not known at this point in the code, making it impossible to check specific deployment conditions directly.
Note: The final DataplaneNodesetsDeployed check still uses nodeset.IsReady() because it validates the completion of the entire minor update workflow, where we do want to ensure the nodeset is fully ready.
Jira: OSPRH-25860