feat: add multi-cluster installation configs for observability-metrics-prometheus Helm chart#73
Conversation
📝 WalkthroughWalkthroughAdds three installation modes (singleCluster, multiClusterExporter, multiClusterReceiver) via new Helm values and conditional templates, introduces RBAC and PrometheusAgent for exporters, an HTTPRoute for receivers, validation helper, README updates, and bumps Helm chart version to 0.3.0. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent as PrometheusAgent\n(MultiCluster Exporter)
participant K8s as Kubernetes\n(ServiceDiscovery)
participant Gateway as Gateway\n(HTTPRoute / observability receiver)
participant Receiver as Prometheus\n(MultiCluster Receiver)
Agent->>K8s: discover targets (pods, nodes, services)
K8s-->>Agent: metrics endpoints
Agent->>Gateway: HTTP remote_write -> observabilityPlaneUrl (/api/v1/write)
Gateway->>Receiver: forward remote_write traffic to Prometheus receiver
Receiver->>Receiver: ingest and store time series
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml (1)
4-28: Consider validating hostnames for multiClusterReceiver mode.The HTTPRoute is rendered without hostnames if
.Values.prometheusCustomizations.http.hostnamesis empty. While this may work for some gateway configurations, exporter clusters typically need a DNS hostname to reach the receiver. Consider adding validation in_helpers.tplto require hostnames formultiClusterReceivermode, similar to howobservabilityPlaneUrlis required formultiClusterExporter.Additionally, the parent gateway name
gateway-defaultis hardcoded. If users have a different gateway name, this would need modification.parentRefs: - - name: gateway-default + - name: {{ .Values.prometheusCustomizations.gateway.name | default "gateway-default" }} namespace: {{ .Release.Namespace }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml` around lines 4 - 28, The HTTPRoute template for multiClusterReceiver renders without validating hostnames and hardcodes the parent gateway name, so add a validation helper in _helpers.tpl that enforces .Values.prometheusCustomizations.http.hostnames is non-empty when .Values.global.installationMode == "multiClusterReceiver" (like the observabilityPlaneUrl check for multiClusterExporter) and update the template to reference a configurable gateway name value (e.g., .Values.global.gatewayName) instead of the literal "gateway-default"; ensure error messages and documentation mention the required hostnames and configurable gateway value.observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml (1)
8-9: Avoid fixed RBAC names to prevent cross-release collisions.Using a fixed cluster-scoped
ClusterRole/ClusterRoleBindingname can conflict when the chart is installed more than once in a cluster.Proposed patch
metadata: - name: openchoreo-observability-agent + name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }} @@ metadata: - name: openchoreo-observability-agent + name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }} @@ metadata: - name: openchoreo-observability-agent + name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }} roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole - name: openchoreo-observability-agent + name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }} subjects: - kind: ServiceAccount - name: openchoreo-observability-agent + name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }}Also applies to: 14-15, 38-46
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml` around lines 8 - 9, The RBAC resources use a fixed name ("openchoreo-observability-agent") which can collide across releases; update the metadata.name fields for the ServiceAccount/ClusterRole/ClusterRoleBinding to be release-scoped (e.g., incorporate {{ .Release.Name }} or the chart fullname template) instead of the constant string so each release gets unique names; apply the same change to the ClusterRole/ClusterRoleBinding declarations referenced at the other locations (lines around the ClusterRole/ClusterRoleBinding blocks) so all RBAC objects are namespaced or suffixed/prefixed by the release identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@observability-metrics-prometheus/helm/templates/_helpers.tpl`:
- Around line 13-20: The template accesses nested values directly
(prometheusCustomizations.http.observabilityPlaneUrl and
$kps.prometheus.enabled) which can cause nil dereferences; update the
_helpers.tpl check to perform nil-safe reads by verifying parent keys exist (use
hasKey/.Values or default/lookup patterns) before accessing .http or
.prometheus, and short-circuit the validation if those parents are missing so
you still emit the intended fail messages safely; specifically guard access to
.Values.prometheusCustomizations, .Values.prometheusCustomizations.http, and the
$kps.prometheus object (from index .Values "kube-prometheus-stack") before
checking observabilityPlaneUrl or enabled.
In `@observability-metrics-prometheus/README.md`:
- Around line 115-116: Replace the misleading metric suggestion
`prometheus_tsdb_symbol_table_size_bytes` with more appropriate receiver/import
error metrics: recommend using `prometheus_remote_storage_samples_failed_total`
to detect remote-write/import failures and/or
`prometheus_tsdb_out_of_order_samples_total` for out-of-order sample problems;
update the README line that currently lists
`rate(prometheus_tsdb_symbol_table_size_bytes[5m])` to suggest
`rate(prometheus_remote_storage_samples_failed_total[5m])` and optionally
`rate(prometheus_tsdb_out_of_order_samples_total[5m])` so the guidance points to
real import-error signals.
---
Nitpick comments:
In `@observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml`:
- Around line 8-9: The RBAC resources use a fixed name
("openchoreo-observability-agent") which can collide across releases; update the
metadata.name fields for the ServiceAccount/ClusterRole/ClusterRoleBinding to be
release-scoped (e.g., incorporate {{ .Release.Name }} or the chart fullname
template) instead of the constant string so each release gets unique names;
apply the same change to the ClusterRole/ClusterRoleBinding declarations
referenced at the other locations (lines around the
ClusterRole/ClusterRoleBinding blocks) so all RBAC objects are namespaced or
suffixed/prefixed by the release identifier.
In
`@observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml`:
- Around line 4-28: The HTTPRoute template for multiClusterReceiver renders
without validating hostnames and hardcodes the parent gateway name, so add a
validation helper in _helpers.tpl that enforces
.Values.prometheusCustomizations.http.hostnames is non-empty when
.Values.global.installationMode == "multiClusterReceiver" (like the
observabilityPlaneUrl check for multiClusterExporter) and update the template to
reference a configurable gateway name value (e.g., .Values.global.gatewayName)
instead of the literal "gateway-default"; ensure error messages and
documentation mention the required hostnames and configurable gateway value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 730f074b-206f-47d4-9b5d-e025f37c267d
📒 Files selected for processing (10)
observability-metrics-prometheus/README.mdobservability-metrics-prometheus/helm/Chart.yamlobservability-metrics-prometheus/helm/templates/_helpers.tplobservability-metrics-prometheus/helm/templates/adapter/deployment.yamlobservability-metrics-prometheus/helm/templates/adapter/service.yamlobservability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yamlobservability-metrics-prometheus/helm/templates/prometheus-agent.yamlobservability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yamlobservability-metrics-prometheus/helm/templates/validate.yamlobservability-metrics-prometheus/helm/values.yaml
| 3. Check that queue capacity and batch settings are appropriate for your metrics volume | ||
| 4. Monitor central Prometheus for import errors: `rate(prometheus_tsdb_symbol_table_size_bytes[5m])` |
There was a problem hiding this comment.
Consider using a more appropriate metric for detecting import errors.
prometheus_tsdb_symbol_table_size_bytes indicates symbol table size, not import errors. For troubleshooting remote write issues at the receiver, consider suggesting prometheus_remote_storage_samples_failed_total or checking for prometheus_tsdb_out_of_order_samples_total instead.
📝 Suggested fix
-4. Monitor central Prometheus for import errors: `rate(prometheus_tsdb_symbol_table_size_bytes[5m])`
+4. Monitor central Prometheus for import errors: `rate(prometheus_remote_storage_samples_failed_total[5m])` or `rate(prometheus_tsdb_out_of_order_samples_total[5m])`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@observability-metrics-prometheus/README.md` around lines 115 - 116, Replace
the misleading metric suggestion `prometheus_tsdb_symbol_table_size_bytes` with
more appropriate receiver/import error metrics: recommend using
`prometheus_remote_storage_samples_failed_total` to detect remote-write/import
failures and/or `prometheus_tsdb_out_of_order_samples_total` for out-of-order
sample problems; update the README line that currently lists
`rate(prometheus_tsdb_symbol_table_size_bytes[5m])` to suggest
`rate(prometheus_remote_storage_samples_failed_total[5m])` and optionally
`rate(prometheus_tsdb_out_of_order_samples_total[5m])` so the guidance points to
real import-error signals.
…s-prometheus Helm chart Signed-off-by: Akila-I <akila.99g@gmail.com>
6c41b36 to
fa43d27
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml (1)
12-13: Make parent gateway name configurable.
gateway-defaultis hardcoded at Line 12, which can break receiver mode in environments using a different Gateway name.♻️ Proposed change
parentRefs: - - name: gateway-default + - name: {{ default "gateway-default" .Values.prometheusCustomizations.http.gatewayName }} namespace: {{ .Release.Namespace }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml` around lines 12 - 13, The parent gateway name is hardcoded as "gateway-default" in prometheus-remote-write-route.yaml; make it configurable by replacing the literal with a Helm value (e.g., .Values.gateway.name or .Values.gatewayName) and provide a sensible default (still "gateway-default") in values.yaml; update the template to use that value where the gateway name is set so receiver mode works with custom Gateway names (refer to the gateway name entry in the prometheus-remote-write-route.yaml template).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml`:
- Around line 14-43: ClusterRole and ClusterRoleBinding use the static name
"openchoreo-prometheus-agent" which will conflict across releases; change the
metadata.name for the ClusterRole and ClusterRoleBinding and the roleRef.name
that points to it to a release-scoped name (e.g., derive from the Helm release
name) in prometheus-agent-rbac.yaml, and update prometheus-agent.yaml to
reference that same release-scoped name so both templates stay consistent;
ensure all occurrences of "openchoreo-prometheus-agent" (metadata.name,
roleRef.name, any subject names) are replaced with the templated release-scoped
identifier used across both files.
---
Nitpick comments:
In
`@observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml`:
- Around line 12-13: The parent gateway name is hardcoded as "gateway-default"
in prometheus-remote-write-route.yaml; make it configurable by replacing the
literal with a Helm value (e.g., .Values.gateway.name or .Values.gatewayName)
and provide a sensible default (still "gateway-default") in values.yaml; update
the template to use that value where the gateway name is set so receiver mode
works with custom Gateway names (refer to the gateway name entry in the
prometheus-remote-write-route.yaml template).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cfcf1e9b-6888-419d-9905-e5ce627a753c
📒 Files selected for processing (10)
observability-metrics-prometheus/README.mdobservability-metrics-prometheus/helm/Chart.yamlobservability-metrics-prometheus/helm/templates/_helpers.tplobservability-metrics-prometheus/helm/templates/adapter/deployment.yamlobservability-metrics-prometheus/helm/templates/adapter/service.yamlobservability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yamlobservability-metrics-prometheus/helm/templates/prometheus-agent.yamlobservability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yamlobservability-metrics-prometheus/helm/templates/validate.yamlobservability-metrics-prometheus/helm/values.yaml
✅ Files skipped from review due to trivial changes (2)
- observability-metrics-prometheus/helm/Chart.yaml
- observability-metrics-prometheus/helm/values.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- observability-metrics-prometheus/helm/templates/_helpers.tpl
- observability-metrics-prometheus/README.md
Purpose
This pull request introduces significant enhancements to the Observability Metrics Prometheus module, focusing on supporting multi-cluster topologies and improving deployment flexibility. The changes add explicit support for three installation modes—
singleCluster,multiClusterReceiver, andmultiClusterExporter—along with corresponding Helm chart logic, documentation, and validation to guide users through setup and troubleshooting.Key changes:
Multi-cluster topology support:
singleCluster,multiClusterReceiver,multiClusterExporter) with corresponding configuration options invalues.yamland detailed documentation inREADME.md. [1] [2] [3]PrometheusAgentdeployment, RBAC, and configuration for themultiClusterExportermode, enabling remote write to a central receiver. [1] [2]HTTPRoutemanifest for remote write traffic inmultiClusterReceivermode, allowing exporter clusters to send metrics to the central receiver.Helm chart validation and logic:
Documentation and versioning:
README.mdwith detailed installation, verification, and troubleshooting instructions for all supported topologies.0.2.6to reflect the new features.Approach
Related Issues
Checklist
Remarks
Summary by CodeRabbit
New Features
Documentation
Chores