OTA-1546: Add a metric and an alert for conditional risk conditions#1318
OTA-1546: Add a metric and an alert for conditional risk conditions#1318hongkailiu wants to merge 9 commits intoopenshift:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Prometheus GaugeVec metric Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml`:
- Around line 86-95: The runbook_url for the alert RiskApplies points to a
non-existent path; update the runbook_url field in the RiskApplies alert stanza
to use the cluster-version-operator runbook path used by similar alerts (e.g.,
match the URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.
In `@pkg/cvo/metrics_test.go`:
- Line 976: The test function named Test_collectConditionalUpdates is testing
collectConditionalUpdateRisks (not collectConditionalUpdates), which is
misleading and conflicts with TestCollectUnknownConditionalUpdates; rename the
test to something like Test_collectConditionalUpdateRisks (or
TestCollectConditionalUpdateRisks) and update any references so the test name
clearly matches the function under test (collectConditionalUpdateRisks) and does
not collide with the existing TestCollectUnknownConditionalUpdates that targets
collectConditionalUpdates.
install/0000_90_cluster-version-operator_02_servicemonitor.yaml
Outdated
Show resolved
Hide resolved
3b6c582 to
129bc80
Compare
|
Testing with 129bc80: Then The alert is
and
|
|
/hold This feature has to be gated by TP. |
9138c46 to
2822b83
Compare
|
Test with 2822b83 Repeating the steps in #1318 (comment) get the same result. $ oc get prometheusrule -n openshift-cluster-version cluster-version-operator-tech-preview
Error from server (NotFound): prometheusrules.monitoring.coreos.com "cluster-version-operator-tech-preview" not foundand no data found for
|
|
/retest-required |
|
Still holding. |
hongkailiu
left a comment
There was a problem hiding this comment.
Thanks for the review. @simonpasquier
f20f114 to
e37e81d
Compare
|
/retest-required |
|
/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323 |
2 similar comments
|
/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323 |
|
/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323 |
|
/retest-required |
install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml
Outdated
Show resolved
Hide resolved
| annotations: | ||
| summary: The cluster has been exposed to the conditional update risk for 10 minutes. | ||
| description: The conditional update risk {{ "{{ $labels.risk }}" }} applies to the cluster, and the cluster update to a version exposed to the risk is not recommended. For more information refer to 'oc adm upgrade'. | ||
| runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md |
4358e43 to
e975641
Compare
|
/retest-required |
|
tested with e975641 and the alert is still working
/verified by @hongkailiu |
|
@hongkailiu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn-techpreview-serial #1323 |
|
/test e2e-hypershift |
|
The failed cases seems not relevant to the change from this pull /retest-required |
|
/retest-required /hold cancel |
| Help: "Report the conditions for active cluster operators. 0 is False and 1 is True.", | ||
| }, []string{"name", "condition", "reason"}), | ||
| clusterVersionRiskConditions: prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Name: "cluster_version_risk_conditions", |
There was a problem hiding this comment.
Not worth holding on, but personally it bugs me that cluster_operator_conditions uses a plural name when each time-series will only be talking about a single condition at a time. I'd have prefered cluster_operator_condition there and cluster_version_risk_condition here. Too late to be worth changing cluster_operator_conditions at this point. Maybe that means we go plural again just for CVO-metric-weirdness consistency. Or maybe we can pivot to singular before going GA.
There was a problem hiding this comment.
Upstream has the same preference on Singulars, such as kube_node_status_condition and kube_deployment_status_condition from CHANGELOG.md.
$ rg '_condition' -c
CHANGELOG.md:5
pkg/customresourcestate/registry_factory_test.go:2
internal/store/persistentvolumeclaim_test.go:39
internal/store/node.go:1
internal/store/persistentvolumeclaim.go:1
internal/store/horizontalpodautoscaler.go:1
internal/store/namespace.go:1
internal/store/deployment_test.go:23
internal/store/horizontalpodautoscaler_test.go:10
internal/store/node_test.go:36
internal/store/certificatesigningrequest_test.go:30
internal/store/namespace_test.go:11
internal/store/certificatesigningrequest.go:1
internal/store/deployment.go:1
docs/metrics/auth/certificatesigningrequest-metrics.md:1
docs/metrics/workload/deployment-metrics.md:1
docs/metrics/workload/horizontalpodautoscaler-metrics.md:1
docs/metrics/storage/persistentvolumeclaim-metrics.md:1
docs/metrics/cluster/node-metrics.md:1
docs/metrics/cluster/namespace-metrics.md:1
$ rg '_conditions' -c
pkg/customresourcestate/registry_factory_test.go:2
Maybe that means we go plural again just for CVO-metric-weirdness consistency. Or maybe we can pivot to singular before going GA.
Works for me.
If I get time after completing the features, I will make a pull to rename both.
|
Failed on random cases. /retest-required |
install/0000_90_cluster-version-operator_02_prometheusrule-TechPreviewNoUpgrade.yaml
Outdated
Show resolved
Hide resolved
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fao89, hongkailiu, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
The runbook-URL fix won't have broken anything, so refreshing this. /verified by @hongkailiu refreshed after an orthogonal change by @wking |
|
@wking: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retest-required |




Follow up #1284 (comment)
The new metic is
cluster_version_risk_conditions{name, condition, risk}and the alert is firing if some risk applies to the cluster.The samples for
cluster_version_risk_conditionswill be collectedonly when its operator
shouldReconcileAcceptRisks. It means, e.g.,on a TechPreview disabled cluster the metric is still defined but
has no samples.
The alert is defined in a new
PrometheusRule/cluster-version-operator-tech-previewwhich is installed only on a TechPreview cluster.