OTA-1864: Extend update recommend precheck alerts#2210
OTA-1864: Extend update recommend precheck alerts#2210wking wants to merge 5 commits intoopenshift:mainfrom
Conversation
This is a warning alert [1], and the machine-config operator team says: An unhealthy kubelet could indicate that the node is unresponsive, etc. and could thus indirectly cause upgrades to stall on that node. Similar to the other KubeNode / kubelet alerts. [1]: https://github.com/openshift/machine-config-operator/blob/e8fd02f68658c3a7a65126740aec4775491ddb6c/install/0000_90_machine-config_01_prometheus-rules.yaml#L95-L100
It's a warning alert [1], so it won't be picked up by our generic 'critical' check. The OpenShift Data Foundation team mentioned this alert as possibly being worth surfacing to cluster-admins considering a cluster-version update. Although the alert's runbook open with [2]: This indicates that there are conditions preventing a safe upgrade to the next ODF version. and ends with: [OLM Operator Conditions Documentation] (https://olm.operatorframework.io/docs/advanced-tasks/communicating-operator-conditions-to-olm/) and those OLM docs say: The Upgradeable “OLM Supported Condition” prevents the existing CSV from being replaced by a newer version of the CSV. I'm not sure how relevant that is to ClusterVersion updates, but I'm fine just passing these along for now, and not trying to second-guess the component operator teams. The ODF team also mentioned these alerts, but they're all critical, so they'll be surfaced by the existing, generic 'critical' check: * CephClusterErrorState [3] * CephMgrIsAbsent [4] * CephClusterCriticallyFull [5] * CephClusterReadOnly [6] * CephMonQuorumLost [7] [1]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/metrics/deploy/prometheus-ocs-rules-external.yaml#L174-L184 [2]: https://github.com/openshift/runbooks/blob/3c559fb0dcec0e6182a10c570d578ec99abe14b5/alerts/openshift-container-storage-operator/ODFOperatorNotUpgradeable.md#meaning [3]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L270-L281 [4]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L38-L49 [5]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L332-L343 [6]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L344-L355 [7]: https://github.com/red-hat-storage/ocs-operator/blob/06d3031b68c98a685a6b399276113a31482566b0/controllers/storagecluster/prometheus/localcephrules.yaml#L90-L101
…utFailing It's a warning alert [1], so it won't be picked up by our generic 'critical' check. The Container-native Virtualization team (CNV) mentioned this alert as possibly being worth surfacing to cluster-admins considering a cluster-version update. The runbook doesn't have much advice other than manually deleting Pods to free up resources [2], but I'm fine just passing these along for now, and not trying to second-guess the component operator teams. [1]: https://github.com/kubevirt/kubevirt/blob/e36f1c7568e813a21fa1e627d9590a4ee12d654f/pkg/monitoring/rules/alerts/virt-handler.go#L32-L42 [2]: https://github.com/openshift/runbooks/blob/3c559fb0dcec0e6182a10c570d578ec99abe14b5/alerts/openshift-virtualization-operator/VirtHandlerDaemonSetRolloutFailing.md
|
@wking: This pull request references OTA-1864 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds UpdatePrecheck alert handling and a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
|
@wking: This pull request references OTA-1864 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
I also tested the label handling with this rule: apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: testing
namespace: openshift-cluster-version
annotations:
kubernetes.io/description: Alerting rules for testing.
spec:
groups:
- name: test
rules:
- alert: TestAlert
annotations:
summary: Test summary.
description: Test description.
expr: |
up{job="cluster-version-operator"} == 1
labels:
severity: warning
namespace: openshift-cluster-version
openShiftUpdatePrecheck: "true"And it worked: $ ./oc adm upgrade recommend
The following conditions found no cause for concern in updating this cluster to later releases: recommended/NodeAlerts (AsExpected)
The following conditions found cause for concern in updating this cluster to later releases: ... recommended/UpdatePrecheckAlerts/TestAlert/0
recommended/UpdatePrecheckAlerts/TestAlert/0=False:
Reason: Alert:firing
Message: warning alert TestAlert firing, suggesting issues worth investigating before updating the cluster. Test summary. The alert description is: Test description. <alert does not have a runbook_url annotation>
... |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/recommend/alerts.go (1)
192-218: Consider tailoring messages for ODF/Virt alerts.The current text (“may slow workload redistribution during rolling node updates”) doesn’t match the semantics of
ODFOperatorNotUpgradeableorVirtHandlerDaemonSetRolloutFailing. A more specific message would be clearer for operators.✏️ Example message tweaks
- Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), + Message: fmt.Sprintf("%s alert %s %s, which may block ODF upgrades and stall cluster updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), ... - Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), + Message: fmt.Sprintf("%s alert %s %s, indicating a daemonset rollout failure that may stall updates until resources are freed. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 192 - 218, The two special-case blocks for alertName == "ODFOperatorNotUpgradeable" and "VirtHandlerDaemonSetRolloutFailing" should use alert-specific Message text instead of the generic "may slow workload redistribution during rolling node updates"; update the metav1.Condition.Message for the ODFOperatorNotUpgradeable branch to mention the ODF operator being not upgradeable and potential storage or cluster upgrade impacts (e.g., "ODF operator is not upgradeable; this may block storage upgrades or compromise data redundancy: %s"), and update the VirtHandlerDaemonSetRolloutFailing branch to reference VM migration/VM runtime availability or failed virt-handler rollout impact (e.g., "virt-handler DaemonSet rollout failing; this may prevent live migration or affect VM workloads: %s"); keep the rest of the condition construction (Type, Status, Reason, acceptanceName) unchanged and preserve the details variable formatting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 70-71: The loop sets haveUpdatePrecheck too late, after several
early-continue branches, so labeled alerts can be skipped and trigger the false
“No alerts with the openShiftUpdatePrecheck label…” message; move the detection
of the openShiftUpdatePrecheck label to the top of the alert-processing logic
(before any continue), set haveUpdatePrecheck = true as soon as the label is
found, and make it additive (just mark the flag and continue normal
classification) rather than mutually exclusive with other categories; apply the
same change for the other occurrences noted (around the blocks at the other
ranges) so the UpdatePrecheck flag is always captured before early-continues.
---
Nitpick comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 192-218: The two special-case blocks for alertName ==
"ODFOperatorNotUpgradeable" and "VirtHandlerDaemonSetRolloutFailing" should use
alert-specific Message text instead of the generic "may slow workload
redistribution during rolling node updates"; update the metav1.Condition.Message
for the ODFOperatorNotUpgradeable branch to mention the ODF operator being not
upgradeable and potential storage or cluster upgrade impacts (e.g., "ODF
operator is not upgradeable; this may block storage upgrades or compromise data
redundancy: %s"), and update the VirtHandlerDaemonSetRolloutFailing branch to
reference VM migration/VM runtime availability or failed virt-handler rollout
impact (e.g., "virt-handler DaemonSet rollout failing; this may prevent live
migration or affect VM workloads: %s"); keep the rest of the condition
construction (Type, Status, Reason, acceptanceName) unchanged and preserve the
details variable formatting.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
pkg/cli/admin/upgrade/recommend/alerts.gopkg/cli/admin/upgrade/status/alerts.go
|
@wking: This pull request references OTA-1864 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/cli/admin/upgrade/recommend/alerts.go (1)
177-190:haveUpdatePrecheckis set too late — after early-continuebranches that may skip it.An alert carrying
openShiftUpdatePrecheck: "true"that also matches an earlier branch (e.g.severity == "critical", or one of the named node/PDB alerts) will hitcontinuebefore reaching line 177, sohaveUpdatePrecheckis never set for that alert. If all UpdatePrecheck-labeled alerts happen to also be critical (or match another earlier branch), the code will emit a false-positive "No alerts with the openShiftUpdatePrecheck label true are firing" condition at line 257.Capture the label at the top of the loop iteration (right after the
alertName/pendingguards), and let the rest of the classification proceed independently:Proposed fix
if alert.State == "pending" { continue } + + if alert.Labels.UpdatePrecheck == "true" { + haveUpdatePrecheck = true + } + var summary string🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 177 - 190, The loop currently sets haveUpdatePrecheck only inside the UpdatePrecheck branch which is skipped by earlier continue paths, causing missed markings; move the detection of the openShiftUpdatePrecheck label to the start of each iteration (immediately after the alertName/pending guards) by reading alert.Labels.UpdatePrecheck into a local variable or directly setting haveUpdatePrecheck when that label equals "true" before any classification/continue logic, then let the existing classification branches (the critical/node/PDB checks and the UpdatePrecheck branch that builds acceptableCondition) run unchanged so the flag correctly reflects any alert carrying the label regardless of other matches.
🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/recommend/alerts.go (1)
206-218: Consider more specific messages for the virt alerts.
VMCannotBeEvictedrelates to VMs blocking node drains, so the "slow workload redistribution" wording is roughly applicable. However,VirtHandlerDaemonSetRolloutFailingis about the virt-handler DaemonSet itself failing to roll out — a different failure mode. A more precise message for each (or at least for the DaemonSet alert) would reduce user confusion:Suggested split
- if alertName == "VirtHandlerDaemonSetRolloutFailing" || alertName == "VMCannotBeEvicted" { + if alertName == "VMCannotBeEvicted" { conditions = append(conditions, acceptableCondition{ Condition: metav1.Condition{ Type: fmt.Sprintf("recommended/VirtAlerts/%s/%d", alertName, i), Status: metav1.ConditionFalse, Reason: fmt.Sprintf("Alert:%s", alert.State), - Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), + Message: fmt.Sprintf("%s alert %s %s, suggesting virtual machines that cannot be evicted may block node drains during updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), + }, + acceptanceName: alertName, + }) + i += 1 + continue + } + + if alertName == "VirtHandlerDaemonSetRolloutFailing" { + conditions = append(conditions, acceptableCondition{ + Condition: metav1.Condition{ + Type: fmt.Sprintf("recommended/VirtAlerts/%s/%d", alertName, i), + Status: metav1.ConditionFalse, + Reason: fmt.Sprintf("Alert:%s", alert.State), + Message: fmt.Sprintf("%s alert %s %s, suggesting the virt-handler DaemonSet is failing to roll out. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), }, acceptanceName: alertName, })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 206 - 218, The current branch lumps two distinct alerts under the same message in the acceptableCondition created in the loop (see alertName, VirtHandlerDaemonSetRolloutFailing, VMCannotBeEvicted and the appended acceptableCondition/Condition). Update the logic so each alert has a more specific Message: keep the VMCannotBeEvicted wording about VMs blocking drains and create a separate, clearer message for VirtHandlerDaemonSetRolloutFailing that mentions the virt-handler DaemonSet failing to roll out (impacting virt-handler pod updates and potential cluster VM management) while still including alert.Labels.Severity, alert.State and details; adjust the branch that builds metav1.Condition and acceptanceName accordingly so the right message is used for each alert name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 192-204: The message for the ODFOperatorNotUpgradeable alert is
misleading (it references node redistribution/drain scenarios); update the
message constructed in the branch handling alertName ==
"ODFOperatorNotUpgradeable" so the metav1.Condition.Message (inside the
acceptableCondition appended) describes the operator not being ready for upgrade
(e.g., operator readiness, upgrade preconditions, potential impact on storage
availability) instead of "which may slow workload redistribution during rolling
node updates"; locate the branch that appends acceptableCondition for alertName
"ODFOperatorNotUpgradeable" in alerts.go and replace the Message string built
with fmt.Sprintf(...) to a concise ODF-operator-specific message while keeping
the rest of the Condition fields and acceptanceName unchanged.
---
Duplicate comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 177-190: The loop currently sets haveUpdatePrecheck only inside
the UpdatePrecheck branch which is skipped by earlier continue paths, causing
missed markings; move the detection of the openShiftUpdatePrecheck label to the
start of each iteration (immediately after the alertName/pending guards) by
reading alert.Labels.UpdatePrecheck into a local variable or directly setting
haveUpdatePrecheck when that label equals "true" before any
classification/continue logic, then let the existing classification branches
(the critical/node/PDB checks and the UpdatePrecheck branch that builds
acceptableCondition) run unchanged so the flag correctly reflects any alert
carrying the label regardless of other matches.
---
Nitpick comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 206-218: The current branch lumps two distinct alerts under the
same message in the acceptableCondition created in the loop (see alertName,
VirtHandlerDaemonSetRolloutFailing, VMCannotBeEvicted and the appended
acceptableCondition/Condition). Update the logic so each alert has a more
specific Message: keep the VMCannotBeEvicted wording about VMs blocking drains
and create a separate, clearer message for VirtHandlerDaemonSetRolloutFailing
that mentions the virt-handler DaemonSet failing to roll out (impacting
virt-handler pod updates and potential cluster VM management) while still
including alert.Labels.Severity, alert.State and details; adjust the branch that
builds metav1.Condition and acceptanceName accordingly so the right message is
used for each alert name.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
pkg/cli/admin/upgrade/recommend/alerts.go
| if alertName == "ODFOperatorNotUpgradeable" { | ||
| conditions = append(conditions, acceptableCondition{ | ||
| Condition: metav1.Condition{ | ||
| Type: fmt.Sprintf("recommended/ODFAlerts/%s/%d", alertName, i), | ||
| Status: metav1.ConditionFalse, | ||
| Reason: fmt.Sprintf("Alert:%s", alert.State), | ||
| Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), | ||
| }, | ||
| acceptanceName: alertName, | ||
| }) | ||
| i += 1 | ||
| continue | ||
| } |
There was a problem hiding this comment.
Misleading message — copy-paste from node alerts doesn't fit ODF.
The message "which may slow workload redistribution during rolling node updates" describes a node-drain scenario. ODFOperatorNotUpgradeable signals that the ODF operator is not ready for an upgrade — a fundamentally different concern. Consider something like:
Suggested message
- Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),
+ Message: fmt.Sprintf("%s alert %s %s, suggesting the ODF operator is not ready for a cluster update. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if alertName == "ODFOperatorNotUpgradeable" { | |
| conditions = append(conditions, acceptableCondition{ | |
| Condition: metav1.Condition{ | |
| Type: fmt.Sprintf("recommended/ODFAlerts/%s/%d", alertName, i), | |
| Status: metav1.ConditionFalse, | |
| Reason: fmt.Sprintf("Alert:%s", alert.State), | |
| Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), | |
| }, | |
| acceptanceName: alertName, | |
| }) | |
| i += 1 | |
| continue | |
| } | |
| if alertName == "ODFOperatorNotUpgradeable" { | |
| conditions = append(conditions, acceptableCondition{ | |
| Condition: metav1.Condition{ | |
| Type: fmt.Sprintf("recommended/ODFAlerts/%s/%d", alertName, i), | |
| Status: metav1.ConditionFalse, | |
| Reason: fmt.Sprintf("Alert:%s", alert.State), | |
| Message: fmt.Sprintf("%s alert %s %s, suggesting the ODF operator is not ready for a cluster update. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), | |
| }, | |
| acceptanceName: alertName, | |
| }) | |
| i += 1 | |
| continue | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 192 - 204, The
message for the ODFOperatorNotUpgradeable alert is misleading (it references
node redistribution/drain scenarios); update the message constructed in the
branch handling alertName == "ODFOperatorNotUpgradeable" so the
metav1.Condition.Message (inside the acceptableCondition appended) describes the
operator not being ready for upgrade (e.g., operator readiness, upgrade
preconditions, potential impact on storage availability) instead of "which may
slow workload redistribution during rolling node updates"; locate the branch
that appends acceptableCondition for alertName "ODFOperatorNotUpgradeable" in
alerts.go and replace the Message string built with fmt.Sprintf(...) to a
concise ODF-operator-specific message while keeping the rest of the Condition
fields and acceptanceName unchanged.
There was a problem hiding this comment.
I think the only exposure cluster updates have to ODF is Node-drain speed, but I'm not ODF 🤷
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
ODFOperatorNotUpgradeable alert OpenShift meaning
💡 Result:
ODFOperatorNotUpgradeable is an OpenShift monitoring alert that fires when the OpenShift Data Foundation (ODF) Operator is reporting “Upgradeable=False” (i.e., it considers itself not safe/allowed to upgrade right now). In OpenShift, this “Upgradeable” condition is used to prevent an Operator’s CSV from being replaced by a newer one until the blocking issue is resolved. [1]
What it usually means in practice (common causes):
- ODF version is incompatible with the OpenShift minor version you’re trying to upgrade to, so OLM/CVO flags it as blocking upgrades (for example, an
openshift-storage/odf-operatorCSV that is incompatible with a higher OpenShift minor version). [2] - Subscription/channel/catalog mismatch during an attempted ODF upgrade (no matching operator found in the selected channel), which can leave the Operator unable to progress to an upgradeable state. [3]
How to confirm what’s blocking (quick checks):
- Look at the ODF operator/CSV “Upgradeable” condition message in the console (Operators → Installed Operators → ODF) or via
oc get csv -n openshift-storage. - Also check the cluster’s upgrade blockers (often shows incompatible operators) via
oc get co operator-lifecycle-manager -o yamland inspect theUpgradeablecondition message. [2]
Sources: [1] [2] [3]
@wking, I searched for details on the alert and found that ODFOperatorNotUpgradeable fires when the ODF Operator reports "Upgradeable=False," typically due to version incompatibility with the target OpenShift version or subscription/channel mismatches. It's a pre-upgrade compatibility check that blocks the operator from upgrading.
So it seems like this is more about operator upgrade readiness than node-drain speed—but I'll defer to your judgment since you know the cluster update flows better! If you're comfortable keeping the node-drain message, that works too. 🙂
|
/retest-required |
4d1654d to
c72c87f
Compare
|
@wking: This pull request references OTA-1864 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/cli/admin/upgrade/recommend/alerts.go (2)
70-71:⚠️ Potential issue | 🟠 MajorCapture UpdatePrecheck before early continues to avoid false “none” condition.
On Line 177,
haveUpdatePrecheckis only set inside the UpdatePrecheck branch, which is after earlycontinues for critical/PDB/node/pull alerts. If an alert has the label and matches those earlier branches, Line 257 can still emit “No alerts with the openShiftUpdatePrecheck label…”. Consider hoisting the label detection before any earlycontinue, and reuse that flag in the UpdatePrecheck branch.🔧 Minimal fix to avoid false “no alerts”
@@ - if alert.State == "pending" { + if alert.State == "pending" { continue } + + isUpdatePrecheck := alert.Labels.UpdatePrecheck == "true" + if isUpdatePrecheck { + haveUpdatePrecheck = true + } @@ - if alert.Labels.UpdatePrecheck == "true" { + if isUpdatePrecheck { haveUpdatePrecheck = true conditions = append(conditions, acceptableCondition{Also applies to: 177-190, 257-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 70 - 71, The code currently sets haveUpdatePrecheck inside the UpdatePrecheck branch after several early continue paths, causing false "none" reports; move the detection of the openShiftUpdatePrecheck label (set haveUpdatePrecheck = true) to run before any early continue checks so the flag is always captured for each alert, then in the UpdatePrecheck handling branch (the existing UpdatePrecheck logic) reuse that precomputed haveUpdatePrecheck flag instead of setting it there; ensure similar hoisting is applied to the other flagged variables if they suffer the same early-continue issue.
192-199:⚠️ Potential issue | 🟡 MinorODFOperatorNotUpgradeable message is misleading.
Line 198 still references “rolling node updates,” which doesn’t match an ODF operator upgradeability issue. Consider a message that reflects operator readiness.
✍️ Suggested message tweak
- Message: fmt.Sprintf("%s alert %s %s, which may slow workload redistribution during rolling node updates. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details), + Message: fmt.Sprintf("%s alert %s %s, suggesting the ODF operator is not ready for a cluster update. %s", alert.Labels.Severity, alert.Labels.AlertName, alert.State, details),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 192 - 199, The message for the ODFOperatorNotUpgradeable case is incorrect (mentions "rolling node updates"); update the Condition.Message in the alerts.go branch that matches alertName == "ODFOperatorNotUpgradeable" (inside the loop building acceptableCondition) to describe operator upgradeability/readiness instead of rolling node updates—for example, state that the alert indicates the ODF operator is not ready for upgrade or may block operator upgrades, and include alert.Labels.Severity, alert.Labels.AlertName, alert.State and details in the formatted message to preserve context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 70-71: The code currently sets haveUpdatePrecheck inside the
UpdatePrecheck branch after several early continue paths, causing false "none"
reports; move the detection of the openShiftUpdatePrecheck label (set
haveUpdatePrecheck = true) to run before any early continue checks so the flag
is always captured for each alert, then in the UpdatePrecheck handling branch
(the existing UpdatePrecheck logic) reuse that precomputed haveUpdatePrecheck
flag instead of setting it there; ensure similar hoisting is applied to the
other flagged variables if they suffer the same early-continue issue.
- Around line 192-199: The message for the ODFOperatorNotUpgradeable case is
incorrect (mentions "rolling node updates"); update the Condition.Message in the
alerts.go branch that matches alertName == "ODFOperatorNotUpgradeable" (inside
the loop building acceptableCondition) to describe operator
upgradeability/readiness instead of rolling node updates—for example, state that
the alert indicates the ODF operator is not ready for upgrade or may block
operator upgrades, and include alert.Labels.Severity, alert.Labels.AlertName,
alert.State and details in the formatted message to preserve context.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
pkg/cli/admin/upgrade/recommend/alerts.gopkg/cli/admin/upgrade/status/alerts.go
…abel Similar to the existing 'critical' label check, this new label provides alert maintainers with a way to declare an alert relevant for cluster-update prechecks, without having to get the name hard-coded into this file. That removes the update team from the middle, and provides an API that alert-maintainers can drive directly, because as an update maintainer, I was not aware that alerts like ODFOperatorNotUpgradeable existed, or that they might be interesting to cluster-admins preparing for a cluster update. By adding this new label logic, that can be a discussion directly between the consuming cluster-admins and the alert maintainers. I'd initially expected to use a URI-like label similar to the kubernetes.io/description annotation, but in testing against a 4.18.31 cluster, it appears that those are not acceptable alert labels: $ oc --as system:admin apply -f test.yaml The "prometheusrules" is invalid: : group "test", rule 1, "TestAlert": invalid label name: openshift.io/update-precheck and even hyphens seem invalid: $ oc --as system:admin apply -f test.yaml The "prometheusrules" is invalid: : group "test", rule 1, "TestAlert": invalid label name: openshift-update-precheck Hopefully the name I've chosen is sufficiently namespaced to not collide with anyone else trying to use that label string with other semantics, even without the ability to use DNS to assign namespace ownership.
It's a warning alert [1], so it won't be picked up by our generic 'critical' check. The Container-native Virtualization team (CNV) mentioned this alert as possibly being worth surfacing to cluster-admins considering a cluster-version update. The runbook doesn't have much advice other than giving up on live migration [2], but I'm fine just passing these along for now, and not trying to second-guess the component operator teams. [1]: https://github.com/kubevirt/kubevirt/blob/e36f1c7568e813a21fa1e627d9590a4ee12d654f/pkg/monitoring/rules/alerts/vms.go#L68-L77 [2]: https://github.com/openshift/runbooks/blob/3c559fb0dcec0e6182a10c570d578ec99abe14b5/alerts/openshift-virtualization-operator/VMCannotBeEvicted.md
c72c87f to
25f8db4
Compare
|
@wking: This pull request references OTA-1864 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/cli/admin/upgrade/recommend/alerts.go (2)
70-70:haveUpdatePrecheckis still set too late — potential false "no precheck alerts" condition.The
haveUpdatePrecheckflag (line 70) is only set at line 178, inside theUpdatePrecheckbranch. Any alert labeledopenShiftUpdatePrecheck: "true"that also matches an earlier branch (e.g.,severity == "critical",alertName == "KubeContainerWaiting", a node alert, etc.) willcontinuebefore ever reaching line 177, leavinghaveUpdatePrecheck = falseand causing the global "No alerts with the openShiftUpdatePrecheck label true are firing." condition (line 262) to be emitted incorrectly.🐛 Minimal fix: detect the label before any early continue
for _, alert := range alertData.Data.Alerts { var alertName string if alertName = alert.Labels.AlertName; alertName == "" { continue } if alert.State == "pending" { continue } + if alert.Labels.UpdatePrecheck == "true" { + haveUpdatePrecheck = true + } // ... (existing summary/description/runbook/details setup) ... if alert.Labels.Severity == "critical" {Also applies to: 177-190, 257-264
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cli/admin/upgrade/recommend/alerts.go` at line 70, The haveUpdatePrecheck flag is set too late and can remain false when an alert carrying the openShiftUpdatePrecheck label is skipped by earlier continue branches; fix by checking alert.Labels (or the equivalent label map used in the loop) for "openShiftUpdatePrecheck" at the top of the alert-processing loop and immediately set haveUpdatePrecheck = true if present before any early-continue logic (this affects the UpdatePrecheck branch and the global "no precheck alerts" condition); update the code that iterates alerts (refer to the loop handling alerts and the haveUpdatePrecheck variable) so the label check runs first for every alert.
198-198: Misleading message forODFOperatorNotUpgradeable— copy-paste from node alerts.The message "which may slow workload redistribution during rolling node updates" describes a node-drain scenario, not ODF operator upgrade readiness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cli/admin/upgrade/recommend/alerts.go` at line 198, The alert message currently uses a copy-pasted node-drain clause; update the code that sets Message (the fmt.Sprintf call that uses alert.Labels.AlertName and alert.State) to special-case alert.Labels.AlertName == "ODFOperatorNotUpgradeable" and set a clear operator-focused message (e.g., indicate the ODF operator is not upgradeable and may block ODF upgrades or require operator remediation) instead of "which may slow workload redistribution during rolling node updates"; modify the fmt.Sprintf branch in alerts.go that builds Message to choose the operator-specific template for "ODFOperatorNotUpgradeable" (or use a small map of alert name → message template) while leaving other alerts unchanged.
🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/recommend/alerts.go (1)
192-218:ODFAlertsandVirtAlertshave no global "AsExpected" condition — inconsistent with all other categories.Every other alert category (
CriticalAlerts,PodDisruptionBudgetAlerts,PodImagePullAlerts,NodeAlerts,UpdatePrecheckAlerts) emits aStatus: True / Reason: AsExpectedcondition when no alerts fire in that category, surfacing them in the "no cause for concern" list.ODFAlertsandVirtAlertsare silent when idle, so users have no signal that these categories were checked.♻️ Proposed fix: add end-of-loop AsExpected conditions for ODF and Virt
+ haveODF := false + haveVirt := false // ... (in the loop) ... if alertName == "ODFOperatorNotUpgradeable" { + haveODF = true conditions = append(conditions, acceptableCondition{ // ... if alertName == "VirtHandlerDaemonSetRolloutFailing" || alertName == "VMCannotBeEvicted" { + haveVirt = true conditions = append(conditions, acceptableCondition{ // ... (after the loop) ... + if !haveODF { + conditions = append(conditions, acceptableCondition{Condition: metav1.Condition{ + Type: "recommended/ODFAlerts", + Status: metav1.ConditionTrue, + Reason: "AsExpected", + Message: "No ODF operator alerts firing.", + }}) + } + + if !haveVirt { + conditions = append(conditions, acceptableCondition{Condition: metav1.Condition{ + Type: "recommended/VirtAlerts", + Status: metav1.ConditionTrue, + Reason: "AsExpected", + Message: "No virtualization alerts firing.", + }}) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cli/admin/upgrade/recommend/alerts.go` around lines 192 - 218, The ODFAlerts and VirtAlerts branches (checking alertName == "ODFOperatorNotUpgradeable" and alertName == "VirtHandlerDaemonSetRolloutFailing" || alertName == "VMCannotBeEvicted") currently only append a False condition when an alert is present and never emit a positive "AsExpected" metav1.Condition when nothing fired; update the logic around the loop that builds conditions so that for these categories you also append an acceptableCondition with Condition{Type: fmt.Sprintf("recommended/ODFAlerts/%s/%d", ...)/"recommended/VirtAlerts/%s/%d", Status: metav1.ConditionTrue, Reason: "AsExpected", Message: "<no alerts>"} (matching the pattern used by other categories) when no matching alerts were encountered, using the same acceptableCondition struct and reusing alertName/i to locate the right place; ensure this is added at the end of the category handling (after the loop or before continuing) so idle ODF and Virt categories produce the positive AsExpected entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Line 70: The haveUpdatePrecheck flag is set too late and can remain false when
an alert carrying the openShiftUpdatePrecheck label is skipped by earlier
continue branches; fix by checking alert.Labels (or the equivalent label map
used in the loop) for "openShiftUpdatePrecheck" at the top of the
alert-processing loop and immediately set haveUpdatePrecheck = true if present
before any early-continue logic (this affects the UpdatePrecheck branch and the
global "no precheck alerts" condition); update the code that iterates alerts
(refer to the loop handling alerts and the haveUpdatePrecheck variable) so the
label check runs first for every alert.
- Line 198: The alert message currently uses a copy-pasted node-drain clause;
update the code that sets Message (the fmt.Sprintf call that uses
alert.Labels.AlertName and alert.State) to special-case alert.Labels.AlertName
== "ODFOperatorNotUpgradeable" and set a clear operator-focused message (e.g.,
indicate the ODF operator is not upgradeable and may block ODF upgrades or
require operator remediation) instead of "which may slow workload redistribution
during rolling node updates"; modify the fmt.Sprintf branch in alerts.go that
builds Message to choose the operator-specific template for
"ODFOperatorNotUpgradeable" (or use a small map of alert name → message
template) while leaving other alerts unchanged.
---
Nitpick comments:
In `@pkg/cli/admin/upgrade/recommend/alerts.go`:
- Around line 192-218: The ODFAlerts and VirtAlerts branches (checking alertName
== "ODFOperatorNotUpgradeable" and alertName ==
"VirtHandlerDaemonSetRolloutFailing" || alertName == "VMCannotBeEvicted")
currently only append a False condition when an alert is present and never emit
a positive "AsExpected" metav1.Condition when nothing fired; update the logic
around the loop that builds conditions so that for these categories you also
append an acceptableCondition with Condition{Type:
fmt.Sprintf("recommended/ODFAlerts/%s/%d", ...)/"recommended/VirtAlerts/%s/%d",
Status: metav1.ConditionTrue, Reason: "AsExpected", Message: "<no alerts>"}
(matching the pattern used by other categories) when no matching alerts were
encountered, using the same acceptableCondition struct and reusing alertName/i
to locate the right place; ensure this is added at the end of the category
handling (after the loop or before continuing) so idle ODF and Virt categories
produce the positive AsExpected entry.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (8)
pkg/cli/admin/upgrade/recommend/alerts.gopkg/cli/admin/upgrade/recommend/examples/4.16.27-degraded-monitoring.outputpkg/cli/admin/upgrade/recommend/examples/4.16.27-degraded-monitoring.show-outdated-releases-outputpkg/cli/admin/upgrade/recommend/examples/4.16.27-degraded-monitoring.version-4.16.32-outputpkg/cli/admin/upgrade/recommend/examples/4.19.0-okd-scos.16.outputpkg/cli/admin/upgrade/recommend/examples/4.19.0-okd-scos.16.show-outdated-releases-outputpkg/cli/admin/upgrade/recommend/examples/4.19.0-okd-scos.16.version-4.19.0-okd-scos.17-outputpkg/cli/admin/upgrade/status/alerts.go
✅ Files skipped from review due to trivial changes (1)
- pkg/cli/admin/upgrade/recommend/examples/4.16.27-degraded-monitoring.version-4.16.32-output
|
/lgtm /retest-required |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
Details in the commit messages :).
Summary by CodeRabbit