feat: add prom metrics for failover reservations#669
Conversation
📝 WalkthroughWalkthroughAdds a FailoverMonitor (Prometheus metrics) and wires it into failover reservation controller construction, registration in main, periodic reconciliation recording, and accompanying unit tests for metrics behavior (plus minor scheduler_client logging cleanup). Changes
Sequence Diagram(s)sequenceDiagram
participant Main as cmd/main
participant Registry as metrics.Registry
participant Monitor as FailoverMonitor
participant Controller as FailoverReservationController
participant Prom as PrometheusScrape
Main->>Monitor: NewFailoverMonitor()
Main->>Registry: Register(Monitor)
alt register success
Registry-->>Main: registered
else register failure
Registry-->>Main: error (logged)
Main-->>Main: continue with monitor=nil
end
Main->>Controller: NewFailoverReservationController(..., monitor)
Controller-->>Main: controller
Note over Controller,Monitor: Periodic reconciliation run
Controller->>Controller: build reconcileSummary
Controller->>Monitor: RecordReconciliation(summary, availabilityZone)
Monitor->>Prom: metrics exposed via Collect/Describe
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/reservations/failover/controller.go`:
- Around line 332-334: summary.totalReservations is being set from the
pre-creation slice (failoverReservations) so it underreports when step 6 creates
additional reservations; update the controller so summary.totalReservations
reflects the actual number after creation by recalculating it once reservation
creation completes (e.g., set summary.totalReservations =
len(failoverReservations) after any newly created reservations have been
appended, or increment summary.totalReservations during the creation loop),
referencing the summary.totalReservations and failoverReservations variables in
the reservation creation flow in controller.go.
In `@internal/scheduling/reservations/scheduler_client.go`:
- Around line 144-147: The current log call logger.V(1).Info("sending external
scheduler request", "url", c.URL, "request", externalSchedulerRequest) leaks
potentially sensitive/high-cardinality data; update the log to remove
externalSchedulerRequest and instead log only explicit, sanitized fields (e.g.,
pipeline ID or name if safe, counts, and boolean flags). Locate the call using
logger.V(1).Info and replace the "request" field with a small structured payload
such as ("pipeline", safePipelineID), ("task_count", len(...)), ("has_priority",
externalSchedulerRequest.Priority != nil) or similar sanitized values derived
from externalSchedulerRequest; ensure no raw identifiers or entire payloads are
logged. Ensure c.URL remains if allowed but avoid including any fields that
could leak identifiers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 60802461-e0fb-4f7e-8dc9-b620d836dfc9
📒 Files selected for processing (6)
cmd/main.gointernal/scheduling/reservations/failover/controller.gointernal/scheduling/reservations/failover/integration_test.gointernal/scheduling/reservations/failover/monitor.gointernal/scheduling/reservations/failover/monitor_test.gointernal/scheduling/reservations/scheduler_client.go
SoWieMarkus
left a comment
There was a problem hiding this comment.
The underreport coderabbit comment should be addressed
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/scheduling/reservations/scheduler_client.go (1)
144-153:⚠️ Potential issue | 🟠 MajorRemove remaining high-cardinality identifiers from request logs
Good improvement removing the full payload, but Line 147 (
instanceUUID) and Line 149 (flavorExtraSpecs) still risk sensitive/high-cardinality logging. Prefer counts/flags only in this log line.Suggested adjustment
logger.V(1).Info("sending external scheduler request", "url", c.URL, "pipeline", req.Pipeline, - "instanceUUID", req.InstanceUUID, - "flavorName", req.FlavorName, - "flavorExtraSpecs", req.FlavorExtraSpecs, "eligibleHostsCount", len(req.EligibleHosts), "ignoreHostsCount", len(req.IgnoreHosts), "hasSchedulerHints", len(req.SchedulerHints) > 0, + "hasFlavorExtraSpecs", len(req.FlavorExtraSpecs) > 0, "availabilityZone", req.AvailabilityZone)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/scheduler_client.go` around lines 144 - 153, The log in scheduler_client.go is still emitting high-cardinality fields; remove req.InstanceUUID and req.FlavorExtraSpecs from the logger call and replace them with safe summary values (e.g., "hasInstanceUUID": req.InstanceUUID != "" or "instanceUUIDPresent": bool, and "flavorExtraSpecsCount": len(req.FlavorExtraSpecs) or "hasFlavorExtraSpecs": len(req.FlavorExtraSpecs) > 0) in the logger.V(1).Info invocation so the entry no longer contains sensitive/unique identifiers while preserving useful telemetry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/scheduling/reservations/scheduler_client.go`:
- Around line 144-153: The log in scheduler_client.go is still emitting
high-cardinality fields; remove req.InstanceUUID and req.FlavorExtraSpecs from
the logger call and replace them with safe summary values (e.g.,
"hasInstanceUUID": req.InstanceUUID != "" or "instanceUUIDPresent": bool, and
"flavorExtraSpecsCount": len(req.FlavorExtraSpecs) or "hasFlavorExtraSpecs":
len(req.FlavorExtraSpecs) > 0) in the logger.V(1).Info invocation so the entry
no longer contains sensitive/unique identifiers while preserving useful
telemetry.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 21a266cd-7e79-4e91-a66e-0397bde194f1
📒 Files selected for processing (2)
internal/scheduling/reservations/failover/controller.gointernal/scheduling/reservations/scheduler_client.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/scheduling/reservations/failover/controller.go
Test Coverage ReportTest Coverage 📊: 69.0% |
No description provided.