Skip to content

Comments

Fix OpenStackVersion minor update workflow hanging on OVN dataplane check#1823

Open
stuggi wants to merge 2 commits intoopenstack-k8s-operators:mainfrom
stuggi:fix_minor_update
Open

Fix OpenStackVersion minor update workflow hanging on OVN dataplane check#1823
stuggi wants to merge 2 commits intoopenstack-k8s-operators:mainfrom
stuggi:fix_minor_update

Conversation

@stuggi
Copy link
Contributor

@stuggi stuggi commented Feb 24, 2026

During a minor update, the OpenStackVersion controller was getting stuck with incorrect condition states even after deployments completed:

  • MinorUpdateOVNDataplane: "in progress" (stuck)
  • MinorUpdateControlplane: "not started" (never executed)

Root Cause:
The DataplaneNodesetsOVNControllerImagesMatch function checked nodeset.IsReady(), which failed when subsequent deployments (e.g., edpm-update) started running. This caused the function to return false even though the OVN update deployment (edpm-ovn-update) had already completed successfully.

The nodeset's overall Ready status was False because edpm-update was running, blocking the minor update workflow from progressing to the next steps (RabbitMQ, MariaDB, controlplane services, etc.).

Solution:
Remove the nodeset.IsReady() check from DataplaneNodesetsOVNControllerImagesMatch. The nodeset's Status.ContainerImages["OvnControllerImage"] is only updated when a deployment completes successfully (openstackdataplanenodeset_controller.go:598-600). Therefore, if the OVN image matches the target version, the OVN update deployment has already completed, regardless of the nodeset's overall Ready status.

Why we can't check deployment-specific conditions: The nodeset stores deployment conditions in Status.DeploymentStatuses map, keyed by deployment name (e.g., "edpm-ovn-update"). However, deployment names are dynamic and not known at this point in the code, making it impossible to check specific deployment conditions directly.

Note: The final DataplaneNodesetsDeployed check still uses nodeset.IsReady() because it validates the completion of the entire minor update workflow, where we do want to ensure the nodeset is fully ready.

Jira: OSPRH-25860

…heck

During a minor update, the OpenStackVersion controller was getting stuck
with incorrect condition states even after deployments completed:
- MinorUpdateOVNDataplane: "in progress" (stuck)
- MinorUpdateControlplane: "not started" (never executed)

Root Cause:
The DataplaneNodesetsOVNControllerImagesMatch function checked
nodeset.IsReady(), which failed when subsequent deployments
(e.g., edpm-update) started running. This caused the function to
return false even though the OVN update deployment (edpm-ovn-update)
had already completed successfully.

The nodeset's overall Ready status was False because edpm-update was
running, blocking the minor update workflow from progressing to the
next steps (RabbitMQ, MariaDB, controlplane services, etc.).

Solution:
Remove the nodeset.IsReady() check from DataplaneNodesetsOVNControllerImagesMatch.
The nodeset's Status.ContainerImages["OvnControllerImage"] is only updated
when a deployment completes successfully (openstackdataplanenodeset_controller.go:598-600).
Therefore, if the OVN image matches the target version, the OVN update
deployment has already completed, regardless of the nodeset's overall Ready status.

Why we can't check deployment-specific conditions:
The nodeset stores deployment conditions in Status.DeploymentStatuses map,
keyed by deployment name (e.g., "edpm-ovn-update"). However, deployment
names are dynamic and not known at this point in the code, making it
impossible to check specific deployment conditions directly.

Note: The final DataplaneNodesetsDeployed check still uses nodeset.IsReady()
because it validates the completion of the entire minor update workflow,
where we do want to ensure the nodeset is fully ready.

Jira: OSPRH-25860

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
@openshift-ci openshift-ci bot requested review from dprince and rabi February 24, 2026 11:16
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stuggi stuggi requested review from abays and rabi and removed request for rabi February 24, 2026 11:16
Copy link
Contributor

@abays abays left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me, but will defer to EDPM team for final approval

@stuggi
Copy link
Contributor Author

stuggi commented Feb 24, 2026

for reference, seen in https://softwarefactory-project.io/zuul/t/rdoproject.org/build/114bae0a80b24333897877431db93b5e where osversion conditions show https://logserver.rdoproject.org/114/rdoproject.org/114bae0a80b24333897877431db93b5e/ci-framework-data/logs/openstack-must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-1663223401da21aaf5f4d8d59101f0d65b9026b7fe7df6e028cefcd367dd2d15/namespaces/openstack/crs/openstackversions.core.openstack.org/controlplane.yaml

  conditions:
  - lastTransitionTime: '2026-02-23T15:54:28Z'
    message: in progress
    reason: Requested
    severity: Info
    status: 'False'
    type: Ready
  - lastTransitionTime: '2026-02-23T14:59:51Z'
    message: completed
    reason: Ready
    status: 'True'
    type: Initialized
  - lastTransitionTime: '2026-02-23T16:03:50Z'
    message: not started
    reason: Init
    status: Unknown
    type: MinorUpdateControlplane
  - lastTransitionTime: '2026-02-23T16:03:50Z'
    message: not started
    reason: Init
    status: Unknown
    type: MinorUpdateDataplane
  - lastTransitionTime: '2026-02-23T16:03:50Z'
    message: not started
    reason: Init
    status: Unknown
    type: MinorUpdateKeystone
  - lastTransitionTime: '2026-02-23T16:03:50Z'
    message: not started
    reason: Init
    status: Unknown
    type: MinorUpdateMariaDB
  - lastTransitionTime: '2026-02-23T16:03:50Z'
    message: not started
    reason: Init
    status: Unknown
    type: MinorUpdateMemcached
  - lastTransitionTime: '2026-02-23T16:02:23Z'
    message: completed
    reason: Ready
    status: 'True'
    type: MinorUpdateOVNControlplane
  - lastTransitionTime: '2026-02-23T16:03:50Z'
    message: in progress
    reason: Requested
    severity: Info
    status: 'False'
    type: MinorUpdateOVNDataplane
  - lastTransitionTime: '2026-02-23T16:03:50Z'
    message: not started
    reason: Init
    status: Unknown
    type: MinorUpdateRabbitMQ

while edpm ovn + ctlplane update finished and the general edpm-update deployment is stuck because of:
https://logserver.rdoproject.org/114/rdoproject.org/114bae0a80b24333897877431db93b5e/ci-framework-data/logs/openstack-must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-1663223401da21aaf5f4d8d59101f0d65b9026b7fe7df6e028cefcd367dd2d15/namespaces/openstack/crs/openstackdataplanedeployments.dataplane.openstack.org/edpm-update.yaml

[1;35mError: unable to copy from source docker://quay.io/podified-antelope-centos9/openstack-nova-compute:current-podified: writing blob: adding layer with blob "sha256:61b29e0d1cfb6b8761e5f1d50a30ba2bb59179d0aa065cf747773fd690df34dd"/""/"sha256:1f1e90f8b2058c74071fe0298f6d20f4d1edbde3bdd940d26fcd35c036f677a8": unpacking failed (error: exit status 1; output: write /var/lib/rpm/rpmdb.sqlite: no space left on device)[0m

@stuggi
Copy link
Contributor Author

stuggi commented Feb 24, 2026

@rabi if you have some time, could you please review this. Also should/can we run the openstack-baremetal-operator-edpm-baremetal-minor-update on the openstack-operator?

@rabi
Copy link
Contributor

rabi commented Feb 24, 2026

@rabi if you have some time, could you please review this. Also should/can we run the openstack-baremetal-operator-edpm-baremetal-minor-update on the openstack-operator?

Yeah we can add the update job in openstack-operator (It's now in edpm-ansible and openstack-baremetal-operator repos). However, I'm thinking is this hiding some other issue. We mark the nodeset ready after ovn-update deployment completes and don't run ovn-update till OpenstackVersion is updated. Is further reconciliation of OpenStackVersion causing this issue?

@stuggi
Copy link
Contributor Author

stuggi commented Feb 24, 2026

the issue with the current code in [1] is that when the later deployment runs to update the remaining services on the edpm node and the nodeset is no longer ready, the ovnupdate condition because of [1] also returns to not ready and reflects a wrong state.

I'll test adding the the job as part of this PR to see it running

[1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/internal/openstack/dataplane.go#L48

@rabi
Copy link
Contributor

rabi commented Feb 24, 2026

the issue with the current code in [1]

Did something change recently? We've the update job running and working fine for a long time https://logserver.rdoproject.org/2d2/rdoproject.org/2d2e39ca37534070a299ee91193de620/

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/05f00347ab5248479af3e12d456fb725

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 53m 49s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 11m 56s
cifmw-crc-podified-edpm-baremetal RETRY_LIMIT in 14m 41s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 40m 22s

@stuggi
Copy link
Contributor Author

stuggi commented Feb 24, 2026

the issue with the current code in [1]

Did something change recently? We've the update job running and working fine for a long time https://logserver.rdoproject.org/2d2/rdoproject.org/2d2e39ca37534070a299ee91193de620/

not that I am aware of. What this PR fixes is not fixing an functional bug. If the edpm-update deployment finishes it would reflect it ok. its just that during that edpm-update run the conditions are not reflecting the previous already run edpm OVN update deployment ran.

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/08f217cc8de14aada95df0a76f569521

✔️ openstack-k8s-operators-content-provider SUCCESS in 29m 08s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 11m 17s
cifmw-crc-podified-edpm-baremetal RETRY_LIMIT in 14m 17s
openstack-operator-tempest-multinode RETRY_LIMIT in 12m 58s
openstack-operator-edpm-baremetal-minor-update RETRY_LIMIT in 14m 42s

@stuggi
Copy link
Contributor Author

stuggi commented Feb 24, 2026

recheck

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a09cacd938f64805b3ba1e14d6ecbe21

✔️ openstack-k8s-operators-content-provider SUCCESS in 28m 32s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 11m 32s
cifmw-crc-podified-edpm-baremetal RETRY_LIMIT in 14m 28s
openstack-operator-tempest-multinode RETRY_LIMIT in 12m 50s
openstack-operator-edpm-baremetal-minor-update RETRY_LIMIT in 14m 12s

Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/1efc4f32f27c44e0903d1073c2a6d2b1

✔️ openstack-k8s-operators-content-provider SUCCESS in 28m 06s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 12m 27s
cifmw-crc-podified-edpm-baremetal RETRY_LIMIT in 14m 22s
openstack-operator-tempest-multinode RETRY_LIMIT in 13m 02s
openstack-operator-edpm-baremetal-minor-update RETRY_LIMIT in 14m 33s

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 24, 2026

@stuggi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl-4-18 88e83d0 link true /test openstack-operator-build-deploy-kuttl-4-18

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants