Skip to content

OTA-1546: Add a metric and an alert for conditional risk conditions#1318

Open
hongkailiu wants to merge 9 commits intoopenshift:mainfrom
hongkailiu:cluster_version_risk_condition
Open

OTA-1546: Add a metric and an alert for conditional risk conditions#1318
hongkailiu wants to merge 9 commits intoopenshift:mainfrom
hongkailiu:cluster_version_risk_condition

Conversation

@hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Feb 15, 2026

Follow up #1284 (comment)

The new metic is cluster_version_risk_conditions{name, condition, risk} and the alert is firing if some risk applies to the cluster.

The samples for cluster_version_risk_conditions will be collected
only when its operator shouldReconcileAcceptRisks. It means, e.g.,
on a TechPreview disabled cluster the metric is still defined but
has no samples.

The alert is defined in a new
PrometheusRule/cluster-version-operator-tech-preview
which is installed only on a TechPreview cluster.

@coderabbitai
Copy link

coderabbitai bot commented Feb 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a Prometheus GaugeVec metric cluster_version_risk_conditions (labels: name, condition, risk), wires its collection into the operator metrics flow, adds unit tests for the collector, and introduces a PrometheusRule alert that fires on risk conditions.

Changes

Cohort / File(s) Summary
Metrics Implementation
pkg/cvo/metrics.go
Adds clusterVersionRiskConditions *prometheus.GaugeVec to operatorMetrics, initializes it in newOperatorMetrics, exposes its descriptor in Describe, adds collectConditionalUpdateRisks to convert ConditionalUpdateRisk entries to metrics, and calls it from the per-ClusterVersion and main Collect flows.
Metrics Testing
pkg/cvo/metrics_test.go
Adds Test_collectConditionalUpdateRisks to assert emitted metrics for various ConditionalUpdateRisk cases. The test appears duplicated (two identical definitions) in the file.
Alert Rule
install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml
Adds a PrometheusRule with a RiskApplies alert that fires when cluster_version_risk_conditions indicates an Applies risk, grouped by namespace/name/risk, with a 10-minute window and warning severity.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml`:
- Around line 86-95: The runbook_url for the alert RiskApplies points to a
non-existent path; update the runbook_url field in the RiskApplies alert stanza
to use the cluster-version-operator runbook path used by similar alerts (e.g.,
match the URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.

In `@pkg/cvo/metrics_test.go`:
- Line 976: The test function named Test_collectConditionalUpdates is testing
collectConditionalUpdateRisks (not collectConditionalUpdates), which is
misleading and conflicts with TestCollectUnknownConditionalUpdates; rename the
test to something like Test_collectConditionalUpdateRisks (or
TestCollectConditionalUpdateRisks) and update any references so the test name
clearly matches the function under test (collectConditionalUpdateRisks) and does
not collide with the existing TestCollectUnknownConditionalUpdates that targets
collectConditionalUpdates.

@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch 2 times, most recently from 3b6c582 to 129bc80 Compare February 15, 2026 13:52
@hongkailiu
Copy link
Member Author

Testing with 129bc80:

launch 4.22,openshift/cluster-version-operator#1318 aws,techpreview

Then

$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://fauxinnati-fauxinnati.apps.ota-stage.q2z4.p1.openshiftapps.com/api/upgrades_info/graph"}}' --type=merge

$ oc adm upgrade channel risks-always

The alert is Pending.

Screenshot 2026-02-15 at 10 46 12

and

Screenshot 2026-02-15 at 10 46 27

@hongkailiu
Copy link
Member Author

/hold

This feature has to be gated by TP.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2026
@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch 5 times, most recently from 9138c46 to 2822b83 Compare February 16, 2026 22:07
@hongkailiu
Copy link
Member Author

hongkailiu commented Feb 16, 2026

Test with 2822b83

launch 4.22,openshift/cluster-version-operator#1318 aws,techpreview
$ oc get prometheusrule -n openshift-cluster-version cluster-version-operator-tech-preview
NAME                                    AGE
cluster-version-operator-tech-preview   37m

Repeating the steps in #1318 (comment) get the same result.


launch 4.22,openshift/cluster-version-operator#1318 gcp,single-node
$ oc get prometheusrule -n openshift-cluster-version cluster-version-operator-tech-preview
Error from server (NotFound): prometheusrules.monitoring.coreos.com "cluster-version-operator-tech-preview" not found

and no data found for cluster_version_risk_conditions.

Screenshot 2026-02-16 at 23 10 44

@hongkailiu
Copy link
Member Author

/retest-required

@hongkailiu
Copy link
Member Author

Still holding.
Wait for #1323 to get in.

Copy link
Member Author

@hongkailiu hongkailiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. @simonpasquier

@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch 2 times, most recently from f20f114 to e37e81d Compare February 18, 2026 00:58
@hongkailiu
Copy link
Member Author

/retest-required

@hongkailiu
Copy link
Member Author

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

2 similar comments
@hongkailiu
Copy link
Member Author

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

@hongkailiu
Copy link
Member Author

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

@hongkailiu
Copy link
Member Author

/retest-required

annotations:
summary: The cluster has been exposed to the conditional update risk for 10 minutes.
description: The conditional update risk {{ "{{ $labels.risk }}" }} applies to the cluster, and the cluster update to a version exposed to the risk is not recommended. For more information refer to 'oc adm upgrade'.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch from 4358e43 to e975641 Compare February 24, 2026 20:19
@hongkailiu
Copy link
Member Author

/retest-required

@hongkailiu
Copy link
Member Author

tested with e975641

and the alert is still working

Screenshot 2026-02-25 at 16 31 55

/verified by @hongkailiu

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 25, 2026
@openshift-ci-robot
Copy link
Contributor

@hongkailiu: This PR has been marked as verified by @hongkailiu.

Details

In response to this:

tested with e975641

and the alert is still working

Screenshot 2026-02-25 at 16 31 55

/verified by @hongkailiu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Member Author

/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323

@hongkailiu
Copy link
Member Author

/test e2e-hypershift

@hongkailiu
Copy link
Member Author

The failed cases seems not relevant to the change from this pull

/retest-required

@hongkailiu
Copy link
Member Author

/retest-required

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2026
Help: "Report the conditions for active cluster operators. 0 is False and 1 is True.",
}, []string{"name", "condition", "reason"}),
clusterVersionRiskConditions: prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "cluster_version_risk_conditions",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not worth holding on, but personally it bugs me that cluster_operator_conditions uses a plural name when each time-series will only be talking about a single condition at a time. I'd have prefered cluster_operator_condition there and cluster_version_risk_condition here. Too late to be worth changing cluster_operator_conditions at this point. Maybe that means we go plural again just for CVO-metric-weirdness consistency. Or maybe we can pivot to singular before going GA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream has the same preference on Singulars, such as kube_node_status_condition and kube_deployment_status_condition from CHANGELOG.md.

$ rg '_condition' -c
CHANGELOG.md:5
pkg/customresourcestate/registry_factory_test.go:2
internal/store/persistentvolumeclaim_test.go:39
internal/store/node.go:1
internal/store/persistentvolumeclaim.go:1
internal/store/horizontalpodautoscaler.go:1
internal/store/namespace.go:1
internal/store/deployment_test.go:23
internal/store/horizontalpodautoscaler_test.go:10
internal/store/node_test.go:36
internal/store/certificatesigningrequest_test.go:30
internal/store/namespace_test.go:11
internal/store/certificatesigningrequest.go:1
internal/store/deployment.go:1
docs/metrics/auth/certificatesigningrequest-metrics.md:1
docs/metrics/workload/deployment-metrics.md:1
docs/metrics/workload/horizontalpodautoscaler-metrics.md:1
docs/metrics/storage/persistentvolumeclaim-metrics.md:1
docs/metrics/cluster/node-metrics.md:1
docs/metrics/cluster/namespace-metrics.md:1

$ rg '_conditions' -c
pkg/customresourcestate/registry_factory_test.go:2

Maybe that means we go plural again just for CVO-metric-weirdness consistency. Or maybe we can pivot to singular before going GA.

Works for me.
If I get time after completing the features, I will make a pull to rename both.

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026
@hongkailiu
Copy link
Member Author

Failed on random cases.

/retest-required

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Feb 27, 2026
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fao89, hongkailiu, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [fao89,hongkailiu,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hongkailiu
Copy link
Member Author

/retest-required

@wking
Copy link
Member

wking commented Feb 27, 2026

The runbook-URL fix won't have broken anything, so refreshing this.

/verified by @hongkailiu refreshed after an orthogonal change by @wking

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 27, 2026
@openshift-ci-robot
Copy link
Contributor

@wking: This PR has been marked as verified by @hongkailiu refreshed after an orthogonal change by @wking.

Details

In response to this:

The runbook-URL fix won't have broken anything, so refreshing this.

/verified by @hongkailiu refreshed after an orthogonal change by @wking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 27, 2026

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn-techpreview-serial f458075 link true /test e2e-agnostic-ovn-techpreview-serial
ci/prow/e2e-agnostic-ovn f458075 link true /test e2e-agnostic-ovn
ci/prow/e2e-hypershift f458075 link true /test e2e-hypershift

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hongkailiu
Copy link
Member Author

/retest-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants