Skip to content

feat: add multi-cluster installation configs for observability-metrics-prometheus Helm chart#73

Merged
akila-i merged 1 commit intoopenchoreo:mainfrom
akila-i:multi-cluster-prom
Apr 16, 2026
Merged

feat: add multi-cluster installation configs for observability-metrics-prometheus Helm chart#73
akila-i merged 1 commit intoopenchoreo:mainfrom
akila-i:multi-cluster-prom

Conversation

@akila-i
Copy link
Copy Markdown
Contributor

@akila-i akila-i commented Apr 8, 2026

Purpose

This pull request introduces significant enhancements to the Observability Metrics Prometheus module, focusing on supporting multi-cluster topologies and improving deployment flexibility. The changes add explicit support for three installation modes—singleCluster, multiClusterReceiver, and multiClusterExporter—along with corresponding Helm chart logic, documentation, and validation to guide users through setup and troubleshooting.

Key changes:

Multi-cluster topology support:

  • Added support for three installation modes (singleCluster, multiClusterReceiver, multiClusterExporter) with corresponding configuration options in values.yaml and detailed documentation in README.md. [1] [2] [3]
  • Introduced a new PrometheusAgent deployment, RBAC, and configuration for the multiClusterExporter mode, enabling remote write to a central receiver. [1] [2]
  • Added a HTTPRoute manifest for remote write traffic in multiClusterReceiver mode, allowing exporter clusters to send metrics to the central receiver.

Helm chart validation and logic:

  • Implemented a validation template to enforce correct installation mode selection and required values, preventing misconfiguration at install/upgrade time. [1] [2]
  • Updated adapter deployment and service templates to only deploy in appropriate modes, ensuring resources are not unnecessarily created in exporter mode. [1] [2] [3] [4]

Documentation and versioning:

  • Expanded the README.md with detailed installation, verification, and troubleshooting instructions for all supported topologies.
  • Bumped chart and app versions to 0.2.6 to reflect the new features.

Approach

Summarize the solution and implementation details.

Related Issues

Include any related issues that are resolved by this PR.

Checklist

  • Tests added or updated (unit, integration, etc.)
  • Samples updated (if applicable)

Remarks

Add any additional context, known issues, or TODOs related to this PR.

Summary by CodeRabbit

  • New Features

    • Support for three installation modes: single-cluster, multi-cluster receiver, and multi-cluster exporter.
    • New config options for remote-write endpoint and gateway hostnames for multi-cluster setups.
  • Documentation

    • Expanded installation guide with separate single-cluster vs multi-cluster steps, exporter/receiver procedures, verification queries, and troubleshooting tips.
  • Chores

    • Chart version bumped to 0.3.0.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

Adds three installation modes (singleCluster, multiClusterExporter, multiClusterReceiver) via new Helm values and conditional templates, introduces RBAC and PrometheusAgent for exporters, an HTTPRoute for receivers, validation helper, README updates, and bumps Helm chart version to 0.3.0.

Changes

Cohort / File(s) Summary
Documentation & chart version
observability-metrics-prometheus/README.md, observability-metrics-prometheus/helm/Chart.yaml
README expanded with installation modes, multi-cluster steps, verification, and troubleshooting. Helm chart version and appVersion bumped to 0.3.0.
Configuration & validation
observability-metrics-prometheus/helm/values.yaml, observability-metrics-prometheus/helm/templates/_helpers.tpl, observability-metrics-prometheus/helm/templates/validate.yaml
Adds global.installationMode and prometheusCustomizations values. New Helm helper observability-metrics-prometheus.validate enforces allowed modes and required fields for multiClusterExporter; validate.yaml invokes it.
Adapter rendering changes
observability-metrics-prometheus/helm/templates/adapter/deployment.yaml, observability-metrics-prometheus/helm/templates/adapter/service.yaml
Adapter Deployment and Service are now conditionally omitted when global.installationMode == "multiClusterExporter".
Multi-cluster exporter resources
observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml, observability-metrics-prometheus/helm/templates/prometheus-agent.yaml
New RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding) and a PrometheusAgent resource rendered when global.installationMode == "multiClusterExporter", configured to remote-write to prometheusCustomizations.http.observabilityPlaneUrl.
Multi-cluster receiver resources
observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml
New HTTPRoute resource rendered when global.installationMode == "multiClusterReceiver" to expose /api/v1/write and forward to openchoreo-observability-prometheus:9091.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as PrometheusAgent\n(MultiCluster Exporter)
    participant K8s as Kubernetes\n(ServiceDiscovery)
    participant Gateway as Gateway\n(HTTPRoute / observability receiver)
    participant Receiver as Prometheus\n(MultiCluster Receiver)

    Agent->>K8s: discover targets (pods, nodes, services)
    K8s-->>Agent: metrics endpoints
    Agent->>Gateway: HTTP remote_write -> observabilityPlaneUrl (/api/v1/write)
    Gateway->>Receiver: forward remote_write traffic to Prometheus receiver
    Receiver->>Receiver: ingest and store time series
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nibble logs and metrics bright,

agents scurry through the night,
receivers listen, routes align,
three modes hop—single, export, combine,
small paws clap for observability's light.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: adding multi-cluster installation configurations to the Helm chart, which is the primary objective reflected throughout the changeset.
Description check ✅ Passed The PR description provides a comprehensive Purpose section detailing key changes and multi-cluster topology support, but the Approach, Related Issues, and Remarks sections are incomplete with only placeholder text.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml (1)

4-28: Consider validating hostnames for multiClusterReceiver mode.

The HTTPRoute is rendered without hostnames if .Values.prometheusCustomizations.http.hostnames is empty. While this may work for some gateway configurations, exporter clusters typically need a DNS hostname to reach the receiver. Consider adding validation in _helpers.tpl to require hostnames for multiClusterReceiver mode, similar to how observabilityPlaneUrl is required for multiClusterExporter.

Additionally, the parent gateway name gateway-default is hardcoded. If users have a different gateway name, this would need modification.

  parentRefs:
-    - name: gateway-default
+    - name: {{ .Values.prometheusCustomizations.gateway.name | default "gateway-default" }}
      namespace: {{ .Release.Namespace }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml`
around lines 4 - 28, The HTTPRoute template for multiClusterReceiver renders
without validating hostnames and hardcodes the parent gateway name, so add a
validation helper in _helpers.tpl that enforces
.Values.prometheusCustomizations.http.hostnames is non-empty when
.Values.global.installationMode == "multiClusterReceiver" (like the
observabilityPlaneUrl check for multiClusterExporter) and update the template to
reference a configurable gateway name value (e.g., .Values.global.gatewayName)
instead of the literal "gateway-default"; ensure error messages and
documentation mention the required hostnames and configurable gateway value.
observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml (1)

8-9: Avoid fixed RBAC names to prevent cross-release collisions.

Using a fixed cluster-scoped ClusterRole/ClusterRoleBinding name can conflict when the chart is installed more than once in a cluster.

Proposed patch
 metadata:
-  name: openchoreo-observability-agent
+  name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }}
@@
 metadata:
-  name: openchoreo-observability-agent
+  name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }}
@@
 metadata:
-  name: openchoreo-observability-agent
+  name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }}
 roleRef:
   apiGroup: rbac.authorization.k8s.io
   kind: ClusterRole
-  name: openchoreo-observability-agent
+  name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }}
 subjects:
   - kind: ServiceAccount
-    name: openchoreo-observability-agent
+    name: {{ printf "%s-observability-agent" .Release.Name | trunc 63 | trimSuffix "-" }}

Also applies to: 14-15, 38-46

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml`
around lines 8 - 9, The RBAC resources use a fixed name
("openchoreo-observability-agent") which can collide across releases; update the
metadata.name fields for the ServiceAccount/ClusterRole/ClusterRoleBinding to be
release-scoped (e.g., incorporate {{ .Release.Name }} or the chart fullname
template) instead of the constant string so each release gets unique names;
apply the same change to the ClusterRole/ClusterRoleBinding declarations
referenced at the other locations (lines around the
ClusterRole/ClusterRoleBinding blocks) so all RBAC objects are namespaced or
suffixed/prefixed by the release identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@observability-metrics-prometheus/helm/templates/_helpers.tpl`:
- Around line 13-20: The template accesses nested values directly
(prometheusCustomizations.http.observabilityPlaneUrl and
$kps.prometheus.enabled) which can cause nil dereferences; update the
_helpers.tpl check to perform nil-safe reads by verifying parent keys exist (use
hasKey/.Values or default/lookup patterns) before accessing .http or
.prometheus, and short-circuit the validation if those parents are missing so
you still emit the intended fail messages safely; specifically guard access to
.Values.prometheusCustomizations, .Values.prometheusCustomizations.http, and the
$kps.prometheus object (from index .Values "kube-prometheus-stack") before
checking observabilityPlaneUrl or enabled.

In `@observability-metrics-prometheus/README.md`:
- Around line 115-116: Replace the misleading metric suggestion
`prometheus_tsdb_symbol_table_size_bytes` with more appropriate receiver/import
error metrics: recommend using `prometheus_remote_storage_samples_failed_total`
to detect remote-write/import failures and/or
`prometheus_tsdb_out_of_order_samples_total` for out-of-order sample problems;
update the README line that currently lists
`rate(prometheus_tsdb_symbol_table_size_bytes[5m])` to suggest
`rate(prometheus_remote_storage_samples_failed_total[5m])` and optionally
`rate(prometheus_tsdb_out_of_order_samples_total[5m])` so the guidance points to
real import-error signals.

---

Nitpick comments:
In `@observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml`:
- Around line 8-9: The RBAC resources use a fixed name
("openchoreo-observability-agent") which can collide across releases; update the
metadata.name fields for the ServiceAccount/ClusterRole/ClusterRoleBinding to be
release-scoped (e.g., incorporate {{ .Release.Name }} or the chart fullname
template) instead of the constant string so each release gets unique names;
apply the same change to the ClusterRole/ClusterRoleBinding declarations
referenced at the other locations (lines around the
ClusterRole/ClusterRoleBinding blocks) so all RBAC objects are namespaced or
suffixed/prefixed by the release identifier.

In
`@observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml`:
- Around line 4-28: The HTTPRoute template for multiClusterReceiver renders
without validating hostnames and hardcodes the parent gateway name, so add a
validation helper in _helpers.tpl that enforces
.Values.prometheusCustomizations.http.hostnames is non-empty when
.Values.global.installationMode == "multiClusterReceiver" (like the
observabilityPlaneUrl check for multiClusterExporter) and update the template to
reference a configurable gateway name value (e.g., .Values.global.gatewayName)
instead of the literal "gateway-default"; ensure error messages and
documentation mention the required hostnames and configurable gateway value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 730f074b-206f-47d4-9b5d-e025f37c267d

📥 Commits

Reviewing files that changed from the base of the PR and between bd7385c and 6c41b36.

📒 Files selected for processing (10)
  • observability-metrics-prometheus/README.md
  • observability-metrics-prometheus/helm/Chart.yaml
  • observability-metrics-prometheus/helm/templates/_helpers.tpl
  • observability-metrics-prometheus/helm/templates/adapter/deployment.yaml
  • observability-metrics-prometheus/helm/templates/adapter/service.yaml
  • observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml
  • observability-metrics-prometheus/helm/templates/prometheus-agent.yaml
  • observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml
  • observability-metrics-prometheus/helm/templates/validate.yaml
  • observability-metrics-prometheus/helm/values.yaml

Comment thread observability-metrics-prometheus/helm/templates/_helpers.tpl
Comment on lines +115 to +116
3. Check that queue capacity and batch settings are appropriate for your metrics volume
4. Monitor central Prometheus for import errors: `rate(prometheus_tsdb_symbol_table_size_bytes[5m])`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider using a more appropriate metric for detecting import errors.

prometheus_tsdb_symbol_table_size_bytes indicates symbol table size, not import errors. For troubleshooting remote write issues at the receiver, consider suggesting prometheus_remote_storage_samples_failed_total or checking for prometheus_tsdb_out_of_order_samples_total instead.

📝 Suggested fix
-4. Monitor central Prometheus for import errors: `rate(prometheus_tsdb_symbol_table_size_bytes[5m])`
+4. Monitor central Prometheus for import errors: `rate(prometheus_remote_storage_samples_failed_total[5m])` or `rate(prometheus_tsdb_out_of_order_samples_total[5m])`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@observability-metrics-prometheus/README.md` around lines 115 - 116, Replace
the misleading metric suggestion `prometheus_tsdb_symbol_table_size_bytes` with
more appropriate receiver/import error metrics: recommend using
`prometheus_remote_storage_samples_failed_total` to detect remote-write/import
failures and/or `prometheus_tsdb_out_of_order_samples_total` for out-of-order
sample problems; update the README line that currently lists
`rate(prometheus_tsdb_symbol_table_size_bytes[5m])` to suggest
`rate(prometheus_remote_storage_samples_failed_total[5m])` and optionally
`rate(prometheus_tsdb_out_of_order_samples_total[5m])` so the guidance points to
real import-error signals.

Comment thread observability-metrics-prometheus/helm/templates/prometheus-agent.yaml Outdated
…s-prometheus Helm chart

Signed-off-by: Akila-I <akila.99g@gmail.com>
@akila-i akila-i force-pushed the multi-cluster-prom branch from 6c41b36 to fa43d27 Compare April 16, 2026 02:58
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml (1)

12-13: Make parent gateway name configurable.

gateway-default is hardcoded at Line 12, which can break receiver mode in environments using a different Gateway name.

♻️ Proposed change
   parentRefs:
-    - name: gateway-default
+    - name: {{ default "gateway-default" .Values.prometheusCustomizations.http.gatewayName }}
       namespace: {{ .Release.Namespace }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml`
around lines 12 - 13, The parent gateway name is hardcoded as "gateway-default"
in prometheus-remote-write-route.yaml; make it configurable by replacing the
literal with a Helm value (e.g., .Values.gateway.name or .Values.gatewayName)
and provide a sensible default (still "gateway-default") in values.yaml; update
the template to use that value where the gateway name is set so receiver mode
works with custom Gateway names (refer to the gateway name entry in the
prometheus-remote-write-route.yaml template).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml`:
- Around line 14-43: ClusterRole and ClusterRoleBinding use the static name
"openchoreo-prometheus-agent" which will conflict across releases; change the
metadata.name for the ClusterRole and ClusterRoleBinding and the roleRef.name
that points to it to a release-scoped name (e.g., derive from the Helm release
name) in prometheus-agent-rbac.yaml, and update prometheus-agent.yaml to
reference that same release-scoped name so both templates stay consistent;
ensure all occurrences of "openchoreo-prometheus-agent" (metadata.name,
roleRef.name, any subject names) are replaced with the templated release-scoped
identifier used across both files.

---

Nitpick comments:
In
`@observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml`:
- Around line 12-13: The parent gateway name is hardcoded as "gateway-default"
in prometheus-remote-write-route.yaml; make it configurable by replacing the
literal with a Helm value (e.g., .Values.gateway.name or .Values.gatewayName)
and provide a sensible default (still "gateway-default") in values.yaml; update
the template to use that value where the gateway name is set so receiver mode
works with custom Gateway names (refer to the gateway name entry in the
prometheus-remote-write-route.yaml template).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cfcf1e9b-6888-419d-9905-e5ce627a753c

📥 Commits

Reviewing files that changed from the base of the PR and between 6c41b36 and fa43d27.

📒 Files selected for processing (10)
  • observability-metrics-prometheus/README.md
  • observability-metrics-prometheus/helm/Chart.yaml
  • observability-metrics-prometheus/helm/templates/_helpers.tpl
  • observability-metrics-prometheus/helm/templates/adapter/deployment.yaml
  • observability-metrics-prometheus/helm/templates/adapter/service.yaml
  • observability-metrics-prometheus/helm/templates/prometheus-agent-rbac.yaml
  • observability-metrics-prometheus/helm/templates/prometheus-agent.yaml
  • observability-metrics-prometheus/helm/templates/prometheus-remote-write-route.yaml
  • observability-metrics-prometheus/helm/templates/validate.yaml
  • observability-metrics-prometheus/helm/values.yaml
✅ Files skipped from review due to trivial changes (2)
  • observability-metrics-prometheus/helm/Chart.yaml
  • observability-metrics-prometheus/helm/values.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • observability-metrics-prometheus/helm/templates/_helpers.tpl
  • observability-metrics-prometheus/README.md

@akila-i akila-i merged commit b5cf76e into openchoreo:main Apr 16, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants