OTA-1546: Add a metric and an alert for conditional risk conditions by hongkailiu · Pull Request #1318 · openshift/cluster-version-operator

hongkailiu · 2026-02-15T02:14:45Z

The new metic is cluster_version_risk_conditions{name, condition, risk} and the alert is firing if some risk applies to the cluster.

The samples for cluster_version_risk_conditions will be collected
only when its operator shouldReconcileAcceptRisks. It means, e.g.,
on a TechPreview disabled cluster the metric is still defined but
has no samples.

The alert is defined in a new
PrometheusRule/cluster-version-operator-tech-preview
which is installed only on a TechPreview cluster.

coderabbitai · 2026-02-15T02:14:59Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a Prometheus GaugeVec metric cluster_version_risk_conditions (labels: name, condition, risk), wires its collection into the operator metrics flow, adds unit tests for the collector, and introduces a PrometheusRule alert that fires on risk conditions.

Changes

Cohort / File(s)	Summary
Metrics Implementation `pkg/cvo/metrics.go`	Adds `clusterVersionRiskConditions *prometheus.GaugeVec` to `operatorMetrics`, initializes it in `newOperatorMetrics`, exposes its descriptor in `Describe`, adds `collectConditionalUpdateRisks` to convert `ConditionalUpdateRisk` entries to metrics, and calls it from the per-ClusterVersion and main `Collect` flows.
Metrics Testing `pkg/cvo/metrics_test.go`	Adds `Test_collectConditionalUpdateRisks` to assert emitted metrics for various `ConditionalUpdateRisk` cases. The test appears duplicated (two identical definitions) in the file.
Alert Rule `install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml`	Adds a PrometheusRule with a `RiskApplies` alert that fires when `cluster_version_risk_conditions` indicates an Applies risk, grouped by namespace/name/risk, with a 10-minute window and `warning` severity.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml`:
- Around line 86-95: The runbook_url for the alert RiskApplies points to a
non-existent path; update the runbook_url field in the RiskApplies alert stanza
to use the cluster-version-operator runbook path used by similar alerts (e.g.,
match the URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.

In `@pkg/cvo/metrics_test.go`:
- Line 976: The test function named Test_collectConditionalUpdates is testing
collectConditionalUpdateRisks (not collectConditionalUpdates), which is
misleading and conflicts with TestCollectUnknownConditionalUpdates; rename the
test to something like Test_collectConditionalUpdateRisks (or
TestCollectConditionalUpdateRisks) and update any references so the test name
clearly matches the function under test (collectConditionalUpdateRisks) and does
not collide with the existing TestCollectUnknownConditionalUpdates that targets
collectConditionalUpdates.

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

pkg/cvo/metrics_test.go

hongkailiu · 2026-02-15T15:49:30Z

Testing with 129bc80:

launch 4.22,openshift/cluster-version-operator#1318 aws,techpreview

Then

$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://fauxinnati-fauxinnati.apps.ota-stage.q2z4.p1.openshiftapps.com/api/upgrades_info/graph"}}' --type=merge

$ oc adm upgrade channel risks-always

The alert is Pending.

and

hongkailiu · 2026-02-16T13:58:08Z

/hold

This feature has to be gated by TP.

hongkailiu · 2026-02-16T23:14:28Z

Test with 2822b83

launch 4.22,openshift/cluster-version-operator#1318 aws,techpreview

$ oc get prometheusrule -n openshift-cluster-version cluster-version-operator-tech-preview
NAME                                    AGE
cluster-version-operator-tech-preview   37m

Repeating the steps in #1318 (comment) get the same result.

launch 4.22,openshift/cluster-version-operator#1318 gcp,single-node

$ oc get prometheusrule -n openshift-cluster-version cluster-version-operator-tech-preview
Error from server (NotFound): prometheusrules.monitoring.coreos.com "cluster-version-operator-tech-preview" not found

and no data found for cluster_version_risk_conditions.

hongkailiu · 2026-02-17T04:11:51Z

/retest-required

hongkailiu · 2026-02-17T04:13:21Z

Still holding.
Wait for #1323 to get in.

pkg/cvo/metrics.go

hongkailiu

Thanks for the review. @simonpasquier

pkg/cvo/metrics.go

hongkailiu · 2026-02-18T23:31:09Z

/retest-required

hongkailiu · 2026-02-20T03:23:45Z

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

hongkailiu · 2026-02-20T12:15:58Z

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

hongkailiu · 2026-02-23T15:33:28Z

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

hongkailiu · 2026-02-23T15:33:42Z

/retest-required

install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml

fao89 · 2026-02-23T15:44:37Z

install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml

+      annotations:
+        summary: The cluster has been exposed to the conditional update risk for 10 minutes.
+        description: The conditional update risk {{ "{{ $labels.risk }}" }} applies to the cluster, and the cluster update to a version exposed to the risk is not recommended. For more information refer to 'oc adm upgrade'.
+        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md


openshift/runbooks#408

hongkailiu · 2026-02-25T21:28:33Z

/retest-required

hongkailiu · 2026-02-25T21:38:33Z

tested with e975641

and the alert is still working

/verified by @hongkailiu

openshift-ci-robot · 2026-02-25T21:38:45Z

@hongkailiu: This PR has been marked as verified by @hongkailiu.

Details

In response to this:

tested with e975641

and the alert is still working

/verified by @hongkailiu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2026-02-25T21:39:18Z

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

hongkailiu · 2026-02-26T02:02:28Z

/test e2e-hypershift

hongkailiu · 2026-02-26T14:05:27Z

The failed cases seems not relevant to the change from this pull

/retest-required

hongkailiu · 2026-02-27T02:27:40Z

/retest-required

/hold cancel

wking · 2026-02-27T03:04:31Z

pkg/cvo/metrics.go

 			Help: "Report the conditions for active cluster operators. 0 is False and 1 is True.",
 		}, []string{"name", "condition", "reason"}),
+		clusterVersionRiskConditions: prometheus.NewGaugeVec(prometheus.GaugeOpts{
+			Name: "cluster_version_risk_conditions",


Not worth holding on, but personally it bugs me that cluster_operator_conditions uses a plural name when each time-series will only be talking about a single condition at a time. I'd have prefered cluster_operator_condition there and cluster_version_risk_condition here. Too late to be worth changing cluster_operator_conditions at this point. Maybe that means we go plural again just for CVO-metric-weirdness consistency. Or maybe we can pivot to singular before going GA.

Upstream has the same preference on Singulars, such as kube_node_status_condition and kube_deployment_status_condition from CHANGELOG.md.

$ rg '_condition' -c CHANGELOG.md:5 pkg/customresourcestate/registry_factory_test.go:2 internal/store/persistentvolumeclaim_test.go:39 internal/store/node.go:1 internal/store/persistentvolumeclaim.go:1 internal/store/horizontalpodautoscaler.go:1 internal/store/namespace.go:1 internal/store/deployment_test.go:23 internal/store/horizontalpodautoscaler_test.go:10 internal/store/node_test.go:36 internal/store/certificatesigningrequest_test.go:30 internal/store/namespace_test.go:11 internal/store/certificatesigningrequest.go:1 internal/store/deployment.go:1 docs/metrics/auth/certificatesigningrequest-metrics.md:1 docs/metrics/workload/deployment-metrics.md:1 docs/metrics/workload/horizontalpodautoscaler-metrics.md:1 docs/metrics/storage/persistentvolumeclaim-metrics.md:1 docs/metrics/cluster/node-metrics.md:1 docs/metrics/cluster/namespace-metrics.md:1 $ rg '_conditions' -c pkg/customresourcestate/registry_factory_test.go:2

Maybe that means we go plural again just for CVO-metric-weirdness consistency. Or maybe we can pivot to singular before going GA.

Works for me.
If I get time after completing the features, I will make a pull to rename both.

wking

/lgtm

hongkailiu · 2026-02-27T13:42:19Z

Failed on random cases.

/retest-required

install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml

wking

/lgtm

openshift-ci · 2026-02-27T14:36:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fao89, hongkailiu, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [fao89,hongkailiu,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hongkailiu · 2026-02-27T19:31:55Z

/retest-required

wking · 2026-02-27T22:01:27Z

The runbook-URL fix won't have broken anything, so refreshing this.

/verified by @hongkailiu refreshed after an orthogonal change by @wking

openshift-ci-robot · 2026-02-27T22:01:38Z

@wking: This PR has been marked as verified by @hongkailiu refreshed after an orthogonal change by @wking.

Details

In response to this:

The runbook-URL fix won't have broken anything, so refreshing this.

/verified by @hongkailiu refreshed after an orthogonal change by @wking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-02-27T22:54:40Z

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-agnostic-ovn-techpreview-serial	`f458075`	link	true	`/test e2e-agnostic-ovn-techpreview-serial`
ci/prow/e2e-agnostic-ovn	`f458075`	link	true	`/test e2e-agnostic-ovn`
ci/prow/e2e-hypershift	`f458075`	link	true	`/test e2e-hypershift`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hongkailiu · 2026-02-28T03:33:31Z

/retest-required

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2026

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

install/0000_90_cluster-version-operator_02_servicemonitor.yaml Outdated Show resolved Hide resolved

pkg/cvo/metrics_test.go Outdated Show resolved Hide resolved

hongkailiu force-pushed the cluster_version_risk_condition branch 2 times, most recently from 3b6c582 to 129bc80 Compare February 15, 2026 13:52

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2026

hongkailiu force-pushed the cluster_version_risk_condition branch 5 times, most recently from 9138c46 to 2822b83 Compare February 16, 2026 22:07

hongkailiu mentioned this pull request Feb 17, 2026

OTA-1546: Add the runbook for RiskApplies openshift/runbooks#408

Merged

simonpasquier reviewed Feb 17, 2026

View reviewed changes

pkg/cvo/metrics.go Outdated Show resolved Hide resolved

pkg/cvo/metrics.go Outdated Show resolved Hide resolved

hongkailiu commented Feb 17, 2026

View reviewed changes

pkg/cvo/metrics.go Outdated Show resolved Hide resolved

pkg/cvo/metrics.go Outdated Show resolved Hide resolved

hongkailiu mentioned this pull request Feb 17, 2026

OTA-1546: Reconcile cv.spec.desiredUpdate.acceptRisks #1284

Merged

hongkailiu force-pushed the cluster_version_risk_condition branch 2 times, most recently from f20f114 to e37e81d Compare February 18, 2026 00:58

hongkailiu requested a review from simonpasquier February 23, 2026 15:34

simonpasquier reviewed Feb 23, 2026

View reviewed changes

install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml Outdated Show resolved Hide resolved

fao89 reviewed Feb 23, 2026

View reviewed changes

hongkailiu added 2 commits February 24, 2026 15:19

Set -1 when Applies=Unknown

5d598cd

Replace abs() == 1 with != 0

e975641

hongkailiu force-pushed the cluster_version_risk_condition branch from 4358e43 to e975641 Compare February 24, 2026 20:19

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 25, 2026

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2026

wking reviewed Feb 27, 2026

View reviewed changes

wking approved these changes Feb 27, 2026

View reviewed changes

openshift-ci bot assigned wking Feb 27, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026

wking reviewed Feb 27, 2026

View reviewed changes

install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml Outdated Show resolved Hide resolved

Fix runbook_url

f458075

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Feb 27, 2026

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026

wking approved these changes Feb 27, 2026

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 27, 2026

Conversation

hongkailiu commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hongkailiu commented Feb 15, 2026

Uh oh!

hongkailiu commented Feb 16, 2026

Uh oh!

hongkailiu commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hongkailiu commented Feb 17, 2026

Uh oh!

hongkailiu commented Feb 17, 2026

Uh oh!

Uh oh!

Uh oh!

hongkailiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hongkailiu commented Feb 18, 2026

Uh oh!

hongkailiu commented Feb 20, 2026

Uh oh!

hongkailiu commented Feb 20, 2026

Uh oh!

hongkailiu commented Feb 23, 2026

Uh oh!

hongkailiu commented Feb 23, 2026

Uh oh!

Uh oh!

fao89 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

hongkailiu commented Feb 25, 2026

Uh oh!

hongkailiu commented Feb 25, 2026

Uh oh!

openshift-ci-robot commented Feb 25, 2026

Uh oh!

hongkailiu commented Feb 25, 2026

Uh oh!

hongkailiu commented Feb 26, 2026

Uh oh!

hongkailiu commented Feb 26, 2026

Uh oh!

hongkailiu commented Feb 27, 2026

Uh oh!

wking Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

hongkailiu Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

hongkailiu commented Feb 27, 2026

Uh oh!

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Feb 27, 2026

Uh oh!

hongkailiu commented Feb 27, 2026

hongkailiu commented Feb 15, 2026 •

edited

Loading

coderabbitai bot commented Feb 15, 2026 •

edited

Loading

hongkailiu commented Feb 16, 2026 •

edited

Loading