Skip to content

OTA-1864: Extend update recommend precheck alerts#2210

Open
wking wants to merge 5 commits intoopenshift:mainfrom
wking:extend-update-recommend-precheck-alerts
Open

OTA-1864: Extend update recommend precheck alerts#2210
wking wants to merge 5 commits intoopenshift:mainfrom
wking:extend-update-recommend-precheck-alerts

Conversation

@wking
Copy link
Member

@wking wking commented Feb 23, 2026

Details in the commit messages :).

Summary by CodeRabbit

  • New Features
    • Upgrade recommendations now report alerts per-alert and by category for clearer guidance.
    • Update-precheck alerts are surfaced explicitly and a global “no update-precheck issues” condition is shown when none are found.
    • Recommendations include additional alert types (KubeletHealthState, ODF operator, and virtualization-related alerts) for more comprehensive guidance.

This is a warning alert [1], and the machine-config operator team
says:

  An unhealthy kubelet could indicate that the node is unresponsive,
  etc. and could thus indirectly cause upgrades to stall on that node.

Similar to the other KubeNode / kubelet alerts.

[1]: https://github.com/openshift/machine-config-operator/blob/e8fd02f68658c3a7a65126740aec4775491ddb6c/install/0000_90_machine-config_01_prometheus-rules.yaml#L95-L100
It's a warning alert [1], so it won't be picked up by our generic
'critical' check.  The OpenShift Data Foundation team mentioned this
alert as possibly being worth surfacing to cluster-admins considering
a cluster-version update.  Although the alert's runbook open with [2]:

  This indicates that there are conditions preventing a safe upgrade to the next ODF version.

and ends with:

  [OLM Operator Conditions Documentation]
  (https://olm.operatorframework.io/docs/advanced-tasks/communicating-operator-conditions-to-olm/)

and those OLM docs say:

  The Upgradeable “OLM Supported Condition” prevents the existing CSV
  from being replaced by a newer version of the CSV.

I'm not sure how relevant that is to ClusterVersion updates, but I'm
fine just passing these along for now, and not trying to second-guess
the component operator teams.

The ODF team also mentioned these alerts, but they're all critical, so
they'll be surfaced by the existing, generic 'critical' check:

* CephClusterErrorState [3]
* CephMgrIsAbsent [4]
* CephClusterCriticallyFull [5]
* CephClusterReadOnly [6]
* CephMonQuorumLost [7]

[1]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/metrics/deploy/prometheus-ocs-rules-external.yaml#L174-L184
[2]: https://github.com/openshift/runbooks/blob/3c559fb0dcec0e6182a10c570d578ec99abe14b5/alerts/openshift-container-storage-operator/ODFOperatorNotUpgradeable.md#meaning
[3]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L270-L281
[4]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L38-L49
[5]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L332-L343
[6]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L344-L355
[7]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L90-L101
…utFailing

It's a warning alert [1], so it won't be picked up by our generic
'critical' check.  The Container-native Virtualization team (CNV)
mentioned this alert as possibly being worth surfacing to
cluster-admins considering a cluster-version update.  The runbook
doesn't have much advice other than manually deleting Pods to free up
resources [2], but I'm fine just passing these along for now, and not
trying to second-guess the component operator teams.

[1]: https://github.com/kubevirt/kubevirt/blob/e36f1c7568e813a21fa1e627d9590a4ee12d654f/pkg/monitoring/rules/alerts/virt-handler.go#L32-L42
[2]: https://github.com/openshift/runbooks/blob/3c559fb0dcec0e6182a10c570d578ec99abe14b5/alerts/openshift-virtualization-operator/VirtHandlerDaemonSetRolloutFailing.md
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 23, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 23, 2026

@wking: This pull request references OTA-1864 which is a valid jira issue.

Details

In response to this:

Details in the commit messages :).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 23, 2026

Walkthrough

Adds UpdatePrecheck alert handling and a new AlertLabels.UpdatePrecheck field; emits per-alert and global UpdatePrecheck conditions, adds specialized condition categories for ODF and virtualization alerts, extends node-related detection to include KubeletHealthState, and embeds alert names/indices in condition Type strings.

Changes

Cohort / File(s) Summary
Alert recommendation logic
pkg/cli/admin/upgrade/recommend/alerts.go
Adds detection for alerts with openShiftUpdatePrecheck via UpdatePrecheck label, emits per-alert recommended/UpdatePrecheckAlerts/<alertName>/<index> conditions and a global recommended/UpdatePrecheckAlerts when none found; treats KubeletHealthState as node-related; emits recommended/NodeAlerts/<alertName>/<index>; special-cases ODFOperatorNotUpgradeablerecommended/ODFAlerts/..., VirtHandlerDaemonSetRolloutFailing and VMCannotBeEvictedrecommended/VirtAlerts/...; uses early continue after special handling and includes alert-name/index in Type strings.
Alert label schema
pkg/cli/admin/upgrade/status/alerts.go
Adds exported field UpdatePrecheck string \json:"openShiftUpdatePrecheck,omitempty"`toAlertLabels` to carry the OpenShift update precheck flag in alert metadata.
Examples / outputs
pkg/cli/admin/upgrade/recommend/examples/*
Updated example outputs to include recommended/UpdatePrecheckAlerts (AsExpected) in lists of recommended/no-cause-for-concern conditions across multiple example files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: extending update recommend precheck alerts functionality.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed No test files with unstable or non-deterministic test names found in the changed files.
Test Structure And Quality ✅ Passed PR code changes to alerts handling logic are covered by fixture-based tests with single responsibility, proper setup/cleanup, context handling, meaningful assertions, and consistent patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 23, 2026

@wking: This pull request references OTA-1864 which is a valid jira issue.

Details

In response to this:

Details in the commit messages :).

Summary by CodeRabbit

Release Notes

  • New Features
  • Enhanced upgrade recommendations with improved alert categorization and pre-upgrade check tracking.
  • Added support for additional alert types in the upgrade process, including KubeletHealthState, ODF Operator, and virtualization-related alerts.
  • Pre-upgrade check alerts now generate specific condition reports to better communicate upgrade readiness status.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wking
Copy link
Member Author

wking commented Feb 23, 2026

I also tested the label handling with this rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: testing
  namespace: openshift-cluster-version
  annotations:
    kubernetes.io/description: Alerting rules for testing.
spec:
  groups:
  - name: test
    rules:
    - alert: TestAlert
      annotations:
        summary: Test summary.
        description: Test description.
      expr: |
        up{job="cluster-version-operator"} == 1
      labels:
        severity: warning
        namespace: openshift-cluster-version
        openShiftUpdatePrecheck: "true"

And it worked:

$ ./oc adm upgrade recommend
The following conditions found no cause for concern in updating this cluster to later releases: recommended/NodeAlerts (AsExpected)

The following conditions found cause for concern in updating this cluster to later releases: ... recommended/UpdatePrecheckAlerts/TestAlert/0

recommended/UpdatePrecheckAlerts/TestAlert/0=False:

  Reason: Alert:firing
  Message: warning alert TestAlert firing, suggesting issues worth investigating before updating the cluster. Test summary. The alert description is: Test description. <alert does not have a runbook_url annotation>
...

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/recommend/alerts.go (1)

192-218: Consider tailoring messages for ODF/Virt alerts.

The current text (“may slow workload redistribution during rolling node updates”) doesn’t match the semantics of ODFOperatorNotUpgradeable or VirtHandlerDaemonSetRolloutFailing. A more specific message would be clearer for operators.

✏️ Example message tweaks
-					Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
+					Message: fmt.Sprintf("%s alert %s %s, which may block ODF upgrades and stall cluster updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
...
-					Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
+					Message: fmt.Sprintf("%s alert %s %s, indicating a daemonset rollout failure that may stall updates until resources are freed. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 192 - 218, The two
special-case blocks for alertName == "ODFOperatorNotUpgradeable" and
"VirtHandlerDaemonSetRolloutFailing" should use alert-specific Message text
instead of the generic "may slow workload redistribution during rolling node
updates"; update the metav1.Condition.Message for the ODFOperatorNotUpgradeable
branch to mention the ODF operator being not upgradeable and potential storage
or cluster upgrade impacts (e.g., "ODF operator is not upgradeable; this may
block storage upgrades or compromise data redundancy: %s"), and update the
VirtHandlerDaemonSetRolloutFailing branch to reference VM migration/VM runtime
availability or failed virt-handler rollout impact (e.g., "virt-handler
DaemonSet rollout failing; this may prevent live migration or affect VM
workloads: %s"); keep the rest of the condition construction (Type, Status,
Reason, acceptanceName) unchanged and preserve the details variable formatting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 70-71: The loop sets haveUpdatePrecheck too late, after several
early-continue branches, so labeled alerts can be skipped and trigger the false
“No alerts with the openShiftUpdatePrecheck label…” message; move the detection
of the openShiftUpdatePrecheck label to the top of the alert-processing logic
(before any continue), set haveUpdatePrecheck = true as soon as the label is
found, and make it additive (just mark the flag and continue normal
classification) rather than mutually exclusive with other categories; apply the
same change for the other occurrences noted (around the blocks at the other
ranges) so the UpdatePrecheck flag is always captured before early-continues.

---

Nitpick comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 192-218: The two special-case blocks for alertName ==
"ODFOperatorNotUpgradeable" and "VirtHandlerDaemonSetRolloutFailing" should use
alert-specific Message text instead of the generic "may slow workload
redistribution during rolling node updates"; update the metav1.Condition.Message
for the ODFOperatorNotUpgradeable branch to mention the ODF operator being not
upgradeable and potential storage or cluster upgrade impacts (e.g., "ODF
operator is not upgradeable; this may block storage upgrades or compromise data
redundancy: %s"), and update the VirtHandlerDaemonSetRolloutFailing branch to
reference VM migration/VM runtime availability or failed virt-handler rollout
impact (e.g., "virt-handler DaemonSet rollout failing; this may prevent live
migration or affect VM workloads: %s"); keep the rest of the condition
construction (Type, Status, Reason, acceptanceName) unchanged and preserve the
details variable formatting.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3638445 and 2b33655.

📒 Files selected for processing (2)
  • pkg/cli/admin/upgrade/recommend/alerts.go
  • pkg/cli/admin/upgrade/status/alerts.go

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 23, 2026

@wking: This pull request references OTA-1864 which is a valid jira issue.

Details

In response to this:

Details in the commit messages :).

Summary by CodeRabbit

  • New Features
  • Enhanced upgrade recommendations with improved alert categorization and per-alert reporting.
  • Adds explicit handling and global reporting for update precheck alerts to surface pre-upgrade readiness.
  • Adds support for additional alert types in recommendations, including KubeletHealthState, ODF operator alerts, and virtualization-related alerts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/cli/admin/upgrade/recommend/alerts.go (1)

177-190: haveUpdatePrecheck is set too late — after early-continue branches that may skip it.

An alert carrying openShiftUpdatePrecheck: "true" that also matches an earlier branch (e.g. severity == "critical", or one of the named node/PDB alerts) will hit continue before reaching line 177, so haveUpdatePrecheck is never set for that alert. If all UpdatePrecheck-labeled alerts happen to also be critical (or match another earlier branch), the code will emit a false-positive "No alerts with the openShiftUpdatePrecheck label true are firing" condition at line 257.

Capture the label at the top of the loop iteration (right after the alertName/pending guards), and let the rest of the classification proceed independently:

Proposed fix
 		if alert.State == "pending" {
 			continue
 		}
+
+		if alert.Labels.UpdatePrecheck == "true" {
+			haveUpdatePrecheck = true
+		}
+
 		var summary string
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 177 - 190, The loop
currently sets haveUpdatePrecheck only inside the UpdatePrecheck branch which is
skipped by earlier continue paths, causing missed markings; move the detection
of the openShiftUpdatePrecheck label to the start of each iteration (immediately
after the alertName/pending guards) by reading alert.Labels.UpdatePrecheck into
a local variable or directly setting haveUpdatePrecheck when that label equals
"true" before any classification/continue logic, then let the existing
classification branches (the critical/node/PDB checks and the UpdatePrecheck
branch that builds acceptableCondition) run unchanged so the flag correctly
reflects any alert carrying the label regardless of other matches.
🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/recommend/alerts.go (1)

206-218: Consider more specific messages for the virt alerts.

VMCannotBeEvicted relates to VMs blocking node drains, so the "slow workload redistribution" wording is roughly applicable. However, VirtHandlerDaemonSetRolloutFailing is about the virt-handler DaemonSet itself failing to roll out — a different failure mode. A more precise message for each (or at least for the DaemonSet alert) would reduce user confusion:

Suggested split
-	if alertName == "VirtHandlerDaemonSetRolloutFailing" || alertName == "VMCannotBeEvicted" {
+	if alertName == "VMCannotBeEvicted" {
 		conditions = append(conditions, acceptableCondition{
 			Condition: metav1.Condition{
 				Type:    fmt.Sprintf("recommended/VirtAlerts/%s/%d", alertName, i),
 				Status:  metav1.ConditionFalse,
 				Reason:  fmt.Sprintf("Alert:%s", alert.State),
-				Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
+				Message: fmt.Sprintf("%s alert %s %s, suggesting virtual machines that cannot be evicted may block node drains during updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
+			},
+			acceptanceName: alertName,
+		})
+		i += 1
+		continue
+	}
+
+	if alertName == "VirtHandlerDaemonSetRolloutFailing" {
+		conditions = append(conditions, acceptableCondition{
+			Condition: metav1.Condition{
+				Type:    fmt.Sprintf("recommended/VirtAlerts/%s/%d", alertName, i),
+				Status:  metav1.ConditionFalse,
+				Reason:  fmt.Sprintf("Alert:%s", alert.State),
+				Message: fmt.Sprintf("%s alert %s %s, suggesting the virt-handler DaemonSet is failing to roll out. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
 			},
 			acceptanceName: alertName,
 		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 206 - 218, The
current branch lumps two distinct alerts under the same message in the
acceptableCondition created in the loop (see alertName,
VirtHandlerDaemonSetRolloutFailing, VMCannotBeEvicted and the appended
acceptableCondition/Condition). Update the logic so each alert has a more
specific Message: keep the VMCannotBeEvicted wording about VMs blocking drains
and create a separate, clearer message for VirtHandlerDaemonSetRolloutFailing
that mentions the virt-handler DaemonSet failing to roll out (impacting
virt-handler pod updates and potential cluster VM management) while still
including alert.Labels.Severity, alert.State and details; adjust the branch that
builds metav1.Condition and acceptanceName accordingly so the right message is
used for each alert name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 192-204: The message for the ODFOperatorNotUpgradeable alert is
misleading (it references node redistribution/drain scenarios); update the
message constructed in the branch handling alertName ==
"ODFOperatorNotUpgradeable" so the metav1.Condition.Message (inside the
acceptableCondition appended) describes the operator not being ready for upgrade
(e.g., operator readiness, upgrade preconditions, potential impact on storage
availability) instead of "which may slow workload redistribution during rolling
node updates"; locate the branch that appends acceptableCondition for alertName
"ODFOperatorNotUpgradeable" in alerts.go and replace the Message string built
with fmt.Sprintf(...) to a concise ODF-operator-specific message while keeping
the rest of the Condition fields and acceptanceName unchanged.

---

Duplicate comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 177-190: The loop currently sets haveUpdatePrecheck only inside
the UpdatePrecheck branch which is skipped by earlier continue paths, causing
missed markings; move the detection of the openShiftUpdatePrecheck label to the
start of each iteration (immediately after the alertName/pending guards) by
reading alert.Labels.UpdatePrecheck into a local variable or directly setting
haveUpdatePrecheck when that label equals "true" before any
classification/continue logic, then let the existing classification branches
(the critical/node/PDB checks and the UpdatePrecheck branch that builds
acceptableCondition) run unchanged so the flag correctly reflects any alert
carrying the label regardless of other matches.

---

Nitpick comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 206-218: The current branch lumps two distinct alerts under the
same message in the acceptableCondition created in the loop (see alertName,
VirtHandlerDaemonSetRolloutFailing, VMCannotBeEvicted and the appended
acceptableCondition/Condition). Update the logic so each alert has a more
specific Message: keep the VMCannotBeEvicted wording about VMs blocking drains
and create a separate, clearer message for VirtHandlerDaemonSetRolloutFailing
that mentions the virt-handler DaemonSet failing to roll out (impacting
virt-handler pod updates and potential cluster VM management) while still
including alert.Labels.Severity, alert.State and details; adjust the branch that
builds metav1.Condition and acceptanceName accordingly so the right message is
used for each alert name.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2b33655 and 4d1654d.

📒 Files selected for processing (1)
  • pkg/cli/admin/upgrade/recommend/alerts.go

Comment on lines +192 to +204
if alertName == "ODFOperatorNotUpgradeable" {
conditions = append(conditions, acceptableCondition{
Condition: metav1.Condition{
Type: fmt.Sprintf("recommended/ODFAlerts/%s/%d", alertName, i),
Status: metav1.ConditionFalse,
Reason: fmt.Sprintf("Alert:%s", alert.State),
Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
},
acceptanceName: alertName,
})
i += 1
continue
}
Copy link

@coderabbitai coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading message — copy-paste from node alerts doesn't fit ODF.

The message "which may slow workload redistribution during rolling node updates" describes a node-drain scenario. ODFOperatorNotUpgradeable signals that the ODF operator is not ready for an upgrade — a fundamentally different concern. Consider something like:

Suggested message
-				Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
+				Message: fmt.Sprintf("%s alert %s %s, suggesting the ODF operator is not ready for a cluster update. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if alertName == "ODFOperatorNotUpgradeable" {
conditions = append(conditions, acceptableCondition{
Condition: metav1.Condition{
Type: fmt.Sprintf("recommended/ODFAlerts/%s/%d", alertName, i),
Status: metav1.ConditionFalse,
Reason: fmt.Sprintf("Alert:%s", alert.State),
Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
},
acceptanceName: alertName,
})
i += 1
continue
}
if alertName == "ODFOperatorNotUpgradeable" {
conditions = append(conditions, acceptableCondition{
Condition: metav1.Condition{
Type: fmt.Sprintf("recommended/ODFAlerts/%s/%d", alertName, i),
Status: metav1.ConditionFalse,
Reason: fmt.Sprintf("Alert:%s", alert.State),
Message: fmt.Sprintf("%s alert %s %s, suggesting the ODF operator is not ready for a cluster update. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
},
acceptanceName: alertName,
})
i += 1
continue
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 192 - 204, The
message for the ODFOperatorNotUpgradeable alert is misleading (it references
node redistribution/drain scenarios); update the message constructed in the
branch handling alertName == "ODFOperatorNotUpgradeable" so the
metav1.Condition.Message (inside the acceptableCondition appended) describes the
operator not being ready for upgrade (e.g., operator readiness, upgrade
preconditions, potential impact on storage availability) instead of "which may
slow workload redistribution during rolling node updates"; locate the branch
that appends acceptableCondition for alertName "ODFOperatorNotUpgradeable" in
alerts.go and replace the Message string built with fmt.Sprintf(...) to a
concise ODF-operator-specific message while keeping the rest of the Condition
fields and acceptanceName unchanged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only exposure cluster updates have to ODF is Node-drain speed, but I'm not ODF 🤷

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

ODFOperatorNotUpgradeable alert OpenShift meaning

💡 Result:

ODFOperatorNotUpgradeable is an OpenShift monitoring alert that fires when the OpenShift Data Foundation (ODF) Operator is reporting “Upgradeable=False” (i.e., it considers itself not safe/allowed to upgrade right now). In OpenShift, this “Upgradeable” condition is used to prevent an Operator’s CSV from being replaced by a newer one until the blocking issue is resolved. [1]

What it usually means in practice (common causes):

  • ODF version is incompatible with the OpenShift minor version you’re trying to upgrade to, so OLM/CVO flags it as blocking upgrades (for example, an openshift-storage/odf-operator CSV that is incompatible with a higher OpenShift minor version). [2]
  • Subscription/channel/catalog mismatch during an attempted ODF upgrade (no matching operator found in the selected channel), which can leave the Operator unable to progress to an upgradeable state. [3]

How to confirm what’s blocking (quick checks):

  • Look at the ODF operator/CSV “Upgradeable” condition message in the console (Operators → Installed Operators → ODF) or via oc get csv -n openshift-storage.
  • Also check the cluster’s upgrade blockers (often shows incompatible operators) via oc get co operator-lifecycle-manager -o yaml and inspect the Upgradeable condition message. [2]

Sources: [1] [2] [3]


@wking, I searched for details on the alert and found that ODFOperatorNotUpgradeable fires when the ODF Operator reports "Upgradeable=False," typically due to version incompatibility with the target OpenShift version or subscription/channel mismatches. It's a pre-upgrade compatibility check that blocks the operator from upgrading.

So it seems like this is more about operator upgrade readiness than node-drain speed—but I'll defer to your judgment since you know the cluster update flows better! If you're comfortable keeping the node-drain message, that works too. 🙂

@wking
Copy link
Member Author

wking commented Feb 24, 2026

/retest-required

@wking wking force-pushed the extend-update-recommend-precheck-alerts branch from 4d1654d to c72c87f Compare February 24, 2026 18:16
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@wking: This pull request references OTA-1864 which is a valid jira issue.

Details

In response to this:

Details in the commit messages :).

Summary by CodeRabbit

  • New Features
  • Enhanced upgrade recommendations with finer alert categorization and per-alert reporting.
  • Explicit handling and global reporting for update-precheck alerts to surface pre-upgrade readiness.
  • Expanded support for additional alert types in recommendations, including KubeletHealthState, ODF operator alerts, and virtualization-related alerts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/cli/admin/upgrade/recommend/alerts.go (2)

70-71: ⚠️ Potential issue | 🟠 Major

Capture UpdatePrecheck before early continues to avoid false “none” condition.

On Line 177, haveUpdatePrecheck is only set inside the UpdatePrecheck branch, which is after early continues for critical/PDB/node/pull alerts. If an alert has the label and matches those earlier branches, Line 257 can still emit “No alerts with the openShiftUpdatePrecheck label…”. Consider hoisting the label detection before any early continue, and reuse that flag in the UpdatePrecheck branch.

🔧 Minimal fix to avoid false “no alerts”
@@
-		if alert.State == "pending" {
+		if alert.State == "pending" {
 			continue
 		}
+
+		isUpdatePrecheck := alert.Labels.UpdatePrecheck == "true"
+		if isUpdatePrecheck {
+			haveUpdatePrecheck = true
+		}
@@
-		if alert.Labels.UpdatePrecheck == "true" {
+		if isUpdatePrecheck {
 			haveUpdatePrecheck = true
 			conditions = append(conditions, acceptableCondition{

Also applies to: 177-190, 257-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 70 - 71, The code
currently sets haveUpdatePrecheck inside the UpdatePrecheck branch after several
early continue paths, causing false "none" reports; move the detection of the
openShiftUpdatePrecheck label (set haveUpdatePrecheck = true) to run before any
early continue checks so the flag is always captured for each alert, then in the
UpdatePrecheck handling branch (the existing UpdatePrecheck logic) reuse that
precomputed haveUpdatePrecheck flag instead of setting it there; ensure similar
hoisting is applied to the other flagged variables if they suffer the same
early-continue issue.

192-199: ⚠️ Potential issue | 🟡 Minor

ODFOperatorNotUpgradeable message is misleading.

Line 198 still references “rolling node updates,” which doesn’t match an ODF operator upgradeability issue. Consider a message that reflects operator readiness.

✍️ Suggested message tweak
-					Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
+					Message: fmt.Sprintf("%s alert %s %s, suggesting the ODF operator is not ready for a cluster update. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 192 - 199, The
message for the ODFOperatorNotUpgradeable case is incorrect (mentions "rolling
node updates"); update the Condition.Message in the alerts.go branch that
matches alertName == "ODFOperatorNotUpgradeable" (inside the loop building
acceptableCondition) to describe operator upgradeability/readiness instead of
rolling node updates—for example, state that the alert indicates the ODF
operator is not ready for upgrade or may block operator upgrades, and include
alert.Labels.Severity, alert.Labels.AlertName, alert.State and details in the
formatted message to preserve context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 70-71: The code currently sets haveUpdatePrecheck inside the
UpdatePrecheck branch after several early continue paths, causing false "none"
reports; move the detection of the openShiftUpdatePrecheck label (set
haveUpdatePrecheck = true) to run before any early continue checks so the flag
is always captured for each alert, then in the UpdatePrecheck handling branch
(the existing UpdatePrecheck logic) reuse that precomputed haveUpdatePrecheck
flag instead of setting it there; ensure similar hoisting is applied to the
other flagged variables if they suffer the same early-continue issue.
- Around line 192-199: The message for the ODFOperatorNotUpgradeable case is
incorrect (mentions "rolling node updates"); update the Condition.Message in the
alerts.go branch that matches alertName == "ODFOperatorNotUpgradeable" (inside
the loop building acceptableCondition) to describe operator
upgradeability/readiness instead of rolling node updates—for example, state that
the alert indicates the ODF operator is not ready for upgrade or may block
operator upgrades, and include alert.Labels.Severity, alert.Labels.AlertName,
alert.State and details in the formatted message to preserve context.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4d1654d and c72c87f.

📒 Files selected for processing (2)
  • pkg/cli/admin/upgrade/recommend/alerts.go
  • pkg/cli/admin/upgrade/status/alerts.go

…abel

Similar to the existing 'critical' label check, this new label
provides alert maintainers with a way to declare an alert relevant for
cluster-update prechecks, without having to get the name hard-coded
into this file.  That removes the update team from the middle, and
provides an API that alert-maintainers can drive directly, because as
an update maintainer, I was not aware that alerts like
ODFOperatorNotUpgradeable existed, or that they might be interesting
to cluster-admins preparing for a cluster update.  By adding this new
label logic, that can be a discussion directly between the consuming
cluster-admins and the alert maintainers.

I'd initially expected to use a URI-like label similar to the
kubernetes.io/description annotation, but in testing against a 4.18.31
cluster, it appears that those are not acceptable alert labels:

  $ oc --as system:admin apply -f test.yaml
  The  "prometheusrules" is invalid: : group "test", rule 1, "TestAlert": invalid label name: openshift.io/update-precheck

and even hyphens seem invalid:

  $ oc --as system:admin apply -f test.yaml
  The  "prometheusrules" is invalid: : group "test", rule 1, "TestAlert": invalid label name: openshift-update-precheck

Hopefully the name I've chosen is sufficiently namespaced to not
collide with anyone else trying to use that label string with other
semantics, even without the ability to use DNS to assign namespace
ownership.
It's a warning alert [1], so it won't be picked up by our generic
'critical' check.  The Container-native Virtualization team (CNV)
mentioned this alert as possibly being worth surfacing to
cluster-admins considering a cluster-version update.  The runbook
doesn't have much advice other than giving up on live migration [2],
but I'm fine just passing these along for now, and not trying to
second-guess the component operator teams.

[1]: https://github.com/kubevirt/kubevirt/blob/e36f1c7568e813a21fa1e627d9590a4ee12d654f/pkg/monitoring/rules/alerts/vms.go#L68-L77
[2]: https://github.com/openshift/runbooks/blob/3c559fb0dcec0e6182a10c570d578ec99abe14b5/alerts/openshift-virtualization-operator/VMCannotBeEvicted.md
@wking wking force-pushed the extend-update-recommend-precheck-alerts branch from c72c87f to 25f8db4 Compare February 24, 2026 20:58
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@wking: This pull request references OTA-1864 which is a valid jira issue.

Details

In response to this:

Details in the commit messages :).

Summary by CodeRabbit

  • New Features
  • Upgrade recommendations now report alerts per-alert and by category for clearer guidance.
  • Update-precheck alerts are surfaced explicitly and a global “no update-precheck issues” condition is shown when none are found.
  • Recommendations include additional alert types (KubeletHealthState, ODF operator, and virtualization-related alerts) for more comprehensive guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/cli/admin/upgrade/recommend/alerts.go (2)

70-70: haveUpdatePrecheck is still set too late — potential false "no precheck alerts" condition.

The haveUpdatePrecheck flag (line 70) is only set at line 178, inside the UpdatePrecheck branch. Any alert labeled openShiftUpdatePrecheck: "true" that also matches an earlier branch (e.g., severity == "critical", alertName == "KubeContainerWaiting", a node alert, etc.) will continue before ever reaching line 177, leaving haveUpdatePrecheck = false and causing the global "No alerts with the openShiftUpdatePrecheck label true are firing." condition (line 262) to be emitted incorrectly.

🐛 Minimal fix: detect the label before any early continue
 	for _, alert := range alertData.Data.Alerts {
 		var alertName string
 		if alertName = alert.Labels.AlertName; alertName == "" {
 			continue
 		}
 		if alert.State == "pending" {
 			continue
 		}
+		if alert.Labels.UpdatePrecheck == "true" {
+			haveUpdatePrecheck = true
+		}
 
 		// ... (existing summary/description/runbook/details setup) ...
 
 		if alert.Labels.Severity == "critical" {

Also applies to: 177-190, 257-264

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` at line 70, The haveUpdatePrecheck
flag is set too late and can remain false when an alert carrying the
openShiftUpdatePrecheck label is skipped by earlier continue branches; fix by
checking alert.Labels (or the equivalent label map used in the loop) for
"openShiftUpdatePrecheck" at the top of the alert-processing loop and
immediately set haveUpdatePrecheck = true if present before any early-continue
logic (this affects the UpdatePrecheck branch and the global "no precheck
alerts" condition); update the code that iterates alerts (refer to the loop
handling alerts and the haveUpdatePrecheck variable) so the label check runs
first for every alert.

198-198: Misleading message for ODFOperatorNotUpgradeable — copy-paste from node alerts.

The message "which may slow workload redistribution during rolling node updates" describes a node-drain scenario, not ODF operator upgrade readiness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` at line 198, The alert message
currently uses a copy-pasted node-drain clause; update the code that sets
Message (the fmt.Sprintf call that uses alert.Labels.AlertName and alert.State)
to special-case alert.Labels.AlertName == "ODFOperatorNotUpgradeable" and set a
clear operator-focused message (e.g., indicate the ODF operator is not
upgradeable and may block ODF upgrades or require operator remediation) instead
of "which may slow workload redistribution during rolling node updates"; modify
the fmt.Sprintf branch in alerts.go that builds Message to choose the
operator-specific template for "ODFOperatorNotUpgradeable" (or use a small map
of alert name → message template) while leaving other alerts unchanged.
🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/recommend/alerts.go (1)

192-218: ODFAlerts and VirtAlerts have no global "AsExpected" condition — inconsistent with all other categories.

Every other alert category (CriticalAlerts, PodDisruptionBudgetAlerts, PodImagePullAlerts, NodeAlerts, UpdatePrecheckAlerts) emits a Status: True / Reason: AsExpected condition when no alerts fire in that category, surfacing them in the "no cause for concern" list. ODFAlerts and VirtAlerts are silent when idle, so users have no signal that these categories were checked.

♻️ Proposed fix: add end-of-loop AsExpected conditions for ODF and Virt
+	haveODF := false
+	haveVirt := false

     // ... (in the loop) ...

 		if alertName == "ODFOperatorNotUpgradeable" {
+			haveODF = true
 			conditions = append(conditions, acceptableCondition{

     // ...

 		if alertName == "VirtHandlerDaemonSetRolloutFailing" || alertName == "VMCannotBeEvicted" {
+			haveVirt = true
 			conditions = append(conditions, acceptableCondition{

     // ... (after the loop) ...

+	if !haveODF {
+		conditions = append(conditions, acceptableCondition{Condition: metav1.Condition{
+			Type:    "recommended/ODFAlerts",
+			Status:  metav1.ConditionTrue,
+			Reason:  "AsExpected",
+			Message: "No ODF operator alerts firing.",
+		}})
+	}
+
+	if !haveVirt {
+		conditions = append(conditions, acceptableCondition{Condition: metav1.Condition{
+			Type:    "recommended/VirtAlerts",
+			Status:  metav1.ConditionTrue,
+			Reason:  "AsExpected",
+			Message: "No virtualization alerts firing.",
+		}})
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 192 - 218, The
ODFAlerts and VirtAlerts branches (checking alertName ==
"ODFOperatorNotUpgradeable" and alertName ==
"VirtHandlerDaemonSetRolloutFailing" || alertName == "VMCannotBeEvicted")
currently only append a False condition when an alert is present and never emit
a positive "AsExpected" metav1.Condition when nothing fired; update the logic
around the loop that builds conditions so that for these categories you also
append an acceptableCondition with Condition{Type:
fmt.Sprintf("recommended/ODFAlerts/%s/%d", ...)/"recommended/VirtAlerts/%s/%d",
Status: metav1.ConditionTrue, Reason: "AsExpected", Message: "<no alerts>"}
(matching the pattern used by other categories) when no matching alerts were
encountered, using the same acceptableCondition struct and reusing alertName/i
to locate the right place; ensure this is added at the end of the category
handling (after the loop or before continuing) so idle ODF and Virt categories
produce the positive AsExpected entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Line 70: The haveUpdatePrecheck flag is set too late and can remain false when
an alert carrying the openShiftUpdatePrecheck label is skipped by earlier
continue branches; fix by checking alert.Labels (or the equivalent label map
used in the loop) for "openShiftUpdatePrecheck" at the top of the
alert-processing loop and immediately set haveUpdatePrecheck = true if present
before any early-continue logic (this affects the UpdatePrecheck branch and the
global "no precheck alerts" condition); update the code that iterates alerts
(refer to the loop handling alerts and the haveUpdatePrecheck variable) so the
label check runs first for every alert.
- Line 198: The alert message currently uses a copy-pasted node-drain clause;
update the code that sets Message (the fmt.Sprintf call that uses
alert.Labels.AlertName and alert.State) to special-case alert.Labels.AlertName
== "ODFOperatorNotUpgradeable" and set a clear operator-focused message (e.g.,
indicate the ODF operator is not upgradeable and may block ODF upgrades or
require operator remediation) instead of "which may slow workload redistribution
during rolling node updates"; modify the fmt.Sprintf branch in alerts.go that
builds Message to choose the operator-specific template for
"ODFOperatorNotUpgradeable" (or use a small map of alert name → message
template) while leaving other alerts unchanged.

---

Nitpick comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 192-218: The ODFAlerts and VirtAlerts branches (checking alertName
== "ODFOperatorNotUpgradeable" and alertName ==
"VirtHandlerDaemonSetRolloutFailing" || alertName == "VMCannotBeEvicted")
currently only append a False condition when an alert is present and never emit
a positive "AsExpected" metav1.Condition when nothing fired; update the logic
around the loop that builds conditions so that for these categories you also
append an acceptableCondition with Condition{Type:
fmt.Sprintf("recommended/ODFAlerts/%s/%d", ...)/"recommended/VirtAlerts/%s/%d",
Status: metav1.ConditionTrue, Reason: "AsExpected", Message: "<no alerts>"}
(matching the pattern used by other categories) when no matching alerts were
encountered, using the same acceptableCondition struct and reusing alertName/i
to locate the right place; ensure this is added at the end of the category
handling (after the loop or before continuing) so idle ODF and Virt categories
produce the positive AsExpected entry.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between c72c87f and 25f8db4.

📒 Files selected for processing (8)
  • pkg/cli/admin/upgrade/recommend/alerts.go
  • pkg/cli/admin/upgrade/recommend/examples/4.16.27-degraded-monitoring.output
  • pkg/cli/admin/upgrade/recommend/examples/4.16.27-degraded-monitoring.show-outdated-releases-output
  • pkg/cli/admin/upgrade/recommend/examples/4.16.27-degraded-monitoring.version-4.16.32-output
  • pkg/cli/admin/upgrade/recommend/examples/4.19.0-okd-scos.16.output
  • pkg/cli/admin/upgrade/recommend/examples/4.19.0-okd-scos.16.show-outdated-releases-output
  • pkg/cli/admin/upgrade/recommend/examples/4.19.0-okd-scos.16.version-4.19.0-okd-scos.17-output
  • pkg/cli/admin/upgrade/status/alerts.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/cli/admin/upgrade/recommend/examples/4.16.27-degraded-monitoring.version-4.16.32-output

@hongkailiu
Copy link
Member

/lgtm

/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn 25f8db4 link true /test e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Feb 27, 2026

e2e-aws-ovn verify-all-openshiftredhatoperators-... KubePodNotReady is unrelated to this client-side oc pull. But no need to kick off a retest (that might go stale, even if it passes) while we're still waiting on a verified label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants