feat: Controllers and http API use CR CRD for managing CR reservation CRDs#773
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReworks commitment handling to be driven by CommittedResource CRDs: adds typed Ready reason constants, changes Change-Commitments to upsert/poll CRs with deterministic rollback, updates controllers to generation-match child Reservations, refactors syncer/monitoring to manage CRDs, adds parent-generation tracking to Reservations, and updates tests, docs, helm values, and tooling. ChangesCommittedResource-driven commitments (single DAG)
Sequence Diagram(s)sequenceDiagram
participant Client as Change-Commitments API
participant ETCD as CommittedResource CRD (etcd)
participant Controller as CommittedResource Controller
participant ResController as Reservation Controller
participant Scheduler as Scheduler Service
Client->>ETCD: Snapshot existing CR specs / record absences
Client->>ETCD: CreateOrUpdate CommittedResource(s)
alt Requires confirmation
Client->>ETCD: Poll CommittedResource.Status.Conditions[Ready]
else No confirmation
Client-->>Client: Return immediately (async)
end
Controller->>ETCD: Reconcile -> ensure child Reservation CRs exist
Controller->>ResController: Create/patch child Reservations (set ParentGeneration)
ResController->>Scheduler: Request placement
Scheduler-->>ResController: Host(s) / No hosts
ResController->>ResController: Patch Reservation.Status (ObservedParentGeneration, Ready)
ResController-->>Controller: Reservation.Status ready/failed
Controller->>ETCD: Set CommittedResource Ready condition with Reason (Accepted/Planned/Reserving/Rejected)
alt Ready terminal (Accepted/Planned)
ETCD-->>Client: Terminal Ready -> success
else Rejected or Timeout
ETCD-->>Client: Ready=False (Rejected) -> client records reason
Client->>ETCD: Rollback changes in reverse order (restore/delete prior CRs)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~90 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/reservations/commitments/api/change_commitments_test.go`:
- Around line 601-607: The current VerifyCRAbsent method treats any error from
env.K8sClient.Get as evidence the CommittedResource was deleted; change it to
explicitly handle the Kubernetes "not found" case using
apierrors.IsNotFound(err): if Get returns nil then call env.T.Errorf that the CR
still exists, if apierrors.IsNotFound(err) return success, otherwise call
env.T.Fatalf or env.T.Errorf with the unexpected Get error (include err) so real
lookup failures aren't mistaken for deletion. Ensure you import
k8s.io/apimachinery/pkg/api/errors as apierrors and update VerifyCRAbsent
accordingly.
In `@internal/scheduling/reservations/commitments/api/change_commitments.go`:
- Around line 196-202: The snapshot Get() call in change_commitments.go currenty
treats any error as "not found" and leaves snap.prevSpec nil, which can cause
rollbackCR() to delete an existing CR; change the logic in the block around
api.client.Get(ctx, types.NamespacedName{Name: crName}, existing) so that: if
err == nil you populate snap.prevSpec as now; if errors.IsNotFound(err) you
treat it as no previous CR; otherwise return/propagate the error to fail the
batch (do not continue) so a transient read error prevents the upsert/rollback
inconsistency; reference crSnapshot, crName, existing, api.client.Get and
rollbackCR when making the change.
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`:
- Around line 170-181: The code currently lists all committed-resource
Reservations with r.List and then filters by cr.Spec.CommitmentUUID in memory
(iterating list.Items), which is O(total reservations) per reconcile; fix this
by creating and using a field index (via mgr.GetFieldIndexer().IndexField) on
Reservation.Spec.CommittedResourceReservation.CommitmentUUID (choose an index
key like "spec.committed.commitmentUUID") and then replace the r.List call to
pass client.MatchingFields{ "spec.committed.commitmentUUID":
cr.Spec.CommitmentUUID } so only matching children are returned (alternatively,
ensure a dedicated label is set on child Reservations and query with
client.MatchingLabels keyed by that UUID); remove the in-memory filter over
list.Items after switching to the index/label query.
- Around line 168-186: The checkChildReservationStatus function currently treats
an empty children slice as "allReady" which lets setAccepted mark a
CommittedResource Ready if the cached client is stale; change
checkChildReservationStatus so that when no matching child Reservations are
found it returns allReady=false (pending) instead of true, and additionally
switch the List call used immediately after applyReservationState to an uncached
reader (e.g., use the controller's uncached client/reader) so post-write
verification reads the live API server; update references in
CommittedResourceController.checkChildReservationStatus and the caller sites
(e.g., after applyReservationState and before setAccepted) to use this new
behavior/uncached read.
In `@internal/scheduling/reservations/commitments/syncer.go`:
- Around line 256-287: The code currently treats CommittedResource upserts
(variables totalCreated/totalUpdated) as reservation creations and feeds
totalCreated into RecordReservationsCreated(), which will misreport
reservations; instead, stop using totalCreated from the CommittedResource upsert
code paths (functions upsertCommittedResource and
updateCommittedResourceIfExists) when calling RecordReservationsCreated(), and
remove any RecordReservationsCreated(totalCreated) calls in this file (including
the similar call around the later block at lines noted). Only invoke
RecordReservationsCreated() at the point where actual child Reservation objects
are created (the reservation creation logic), or introduce a distinct counter
(e.g., reservationsCreatedForChildren) that is incremented where Reservation
resources are created and pass that to RecordReservationsCreated(); do not use
totalCreated from the CommittedResource upsert logic.
- Around line 303-323: The loop in SyncReservations iterates all
CommittedResource CRs (existingCRs) and treats any not in activeCommitments as
stale and deletes expired ones, but SyncReservations only manages Nova-domain
CRs (applyCommittedResourceSpec sets SchedulingDomainNova); update the logic in
the loop inside SyncReservations so you first skip any CR whose
Spec.SchedulingDomain is not v1alpha1.SchedulingDomainNova (or otherwise not the
Nova domain constant) before incrementing staleCRCount or attempting to GC via
s.Delete; ensure references to activeCommitments, existingCRs.Items, s.List and
s.Delete remain and that only Nova-domain CommittedResource objects are
considered for stale counting and deletion.
- Around line 116-127: The default branch in the commitment status validation
currently just logs and continues, which causes unknown statuses to be treated
as absent later when building activeCommitments and may trigger destructive
orphan cleanup; modify the default case in the switch within syncer.go (the
block validating v1alpha1.CommitmentStatus(commitment.Status)) to append the
commitment's id to the existing skippedUUIDs collection (preserving skipped
UUIDs) in addition to logging, so unknown/new statuses are excluded from
activeCommitments but not treated as deletable; keep the log message and ensure
skippedUUIDs is used consistently when constructing activeCommitments later.
- Around line 335-343: The code currently derives the commitment UUID from the
reservation name using extractCommitmentUUID(res.Name), which fails when child
reservations use custom name prefixes; instead read the UUID from the
reservation spec at res.Spec.CommittedResourceReservation.CommitmentUUID (use
that value for the activeCommitments lookup and deletion logic), falling back to
logging and skipping if the spec field is empty, and keep the deletion call
(s.Delete(ctx, res)) and client.IgnoreNotFound(err) behavior unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 102a6374-70d5-43da-b44c-3aaf7500c51a
📒 Files selected for processing (12)
api/v1alpha1/committed_resource_types.godocs/reservations/committed-resource-reservations.mdinternal/scheduling/reservations/commitments/api/change_commitments.gointernal/scheduling/reservations/commitments/api/change_commitments_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/committed_resource_integration_test.gointernal/scheduling/reservations/commitments/state.gointernal/scheduling/reservations/commitments/syncer.gointernal/scheduling/reservations/commitments/syncer_monitor.gointernal/scheduling/reservations/commitments/syncer_monitor_test.gointernal/scheduling/reservations/commitments/syncer_test.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/scheduling/reservations/commitments/reservation_controller.go (1)
476-483:⚠️ Potential issue | 🟡 MinorMissing
echoParentGenerationcall inreconcileAllocationsstatus patch.The
reconcileAllocationsfunction patches the reservation status at line 478 but doesn't callechoParentGenerationbeforehand. This could cause the CR controller to miss that this reservation was processed, sinceObservedParentGenerationwon't be updated during allocation reconciliation.Consider adding the echo call before the status patch:
🔧 Proposed fix
// Patch Status + echoParentGeneration(res) patch := client.MergeFrom(old) if err := r.Status().Patch(ctx, res, patch); err != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/reservation_controller.go` around lines 476 - 483, In reconcileAllocations, before creating the status patch (patch := client.MergeFrom(old)) and calling r.Status().Patch(ctx, res, patch), invoke echoParentGeneration(ctx, r, res) to update res.Status.ObservedParentGeneration so the controller records that the reservation was processed during allocation reconciliation; this should be done immediately prior to the status patch call so the patched status includes the echoed parent generation.
🧹 Nitpick comments (3)
internal/scheduling/reservations/commitments/syncer_monitor.go (1)
38-42: Consider if histogram buckets cover expected sync durations.The buckets
[0.5, 1, 5, 10, 30, 60, 120]seconds should capture most sync runs. If syncs typically complete in under 500ms, you may want to add a smaller bucket (e.g., 0.1s) for better resolution. This is a minor observability enhancement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/syncer_monitor.go` around lines 38 - 42, The histogram buckets for syncDuration (created via prometheus.NewHistogram with HistogramOpts for "cortex_committed_resource_syncer_duration_seconds") currently start at 0.5s; add a smaller bucket (e.g., 0.1) to the Buckets slice so sub-500ms syncs are better resolved — update the Buckets value in the syncDuration initialization to include 0.1 (or another appropriate low bound) while preserving the existing buckets.internal/scheduling/reservations/commitments/api/change_commitments.go (1)
280-347: Consider adding a comment clarifying theReason=Reservingcase.The
watchCRsUntilReadyfunction handles terminal states well. Line 332 has a trailing comment aboutReason=Reserving, but the implicit fall-through behavior (no case match → keep polling) could be clearer for maintainability.Suggested clarification
case cond.Status == metav1.ConditionFalse && cond.Reason == v1alpha1.CommittedResourceReasonRejected: delete(pending, name) rejected[name] = cond.Message - // Reason=Reserving: controller is placing slots; keep waiting. + // Reason=Reserving (or any other non-terminal reason): controller is still placing slots; keep waiting. + default: + // continue polling }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/api/change_commitments.go` around lines 280 - 347, In watchCRsUntilReady, make the implicit "Reason=Reserving" behavior explicit so future readers understand we intentionally continue polling; either add a clear comment next to the switch block (or add an explicit case like `case cond.Status==metav1.ConditionFalse && cond.Reason==v1alpha1.CommittedResourceReasonReserving: // keep waiting — controller is placing slots` with no delete) to document that Reserving is not terminal and should be retried, referencing the switch in watchCRsUntilReady that currently only handles Planned/Rejected/True.internal/scheduling/reservations/commitments/syncer_test.go (1)
788-799: Variable shadows imported package name.Line 793 declares
var dto dto.Metricwhich shadows the importeddtopackage fromgithub.com/prometheus/client_model/go. While this works, it can cause confusion.Suggested fix
ch := make(chan prometheus.Metric, 10) monitor.staleCRs.Collect(ch) close(ch) m := <-ch - var dto dto.Metric - if err := m.Write(&dto); err != nil { + var metric dto.Metric + if err := m.Write(&metric); err != nil { t.Fatalf("failed to read metric: %v", err) } - if got := dto.GetGauge().GetValue(); got != 1 { + if got := metric.GetGauge().GetValue(); got != 1 { t.Errorf("Expected staleCRs gauge=1, got %v", got) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/syncer_test.go` around lines 788 - 799, The test declares "var dto dto.Metric" which shadows the imported dto package; rename the local variable (e.g., "var metric dto.Metric" or "metricDTO := dto.Metric{}") and update subsequent references (change m.Write(&dto) to m.Write(&metric) and dto.GetGauge() to metric.GetGauge()) so the imported package name is not shadowed while keeping the type as dto.Metric and the logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/scheduling/reservations/commitments/reservation_controller.go`:
- Around line 476-483: In reconcileAllocations, before creating the status patch
(patch := client.MergeFrom(old)) and calling r.Status().Patch(ctx, res, patch),
invoke echoParentGeneration(ctx, r, res) to update
res.Status.ObservedParentGeneration so the controller records that the
reservation was processed during allocation reconciliation; this should be done
immediately prior to the status patch call so the patched status includes the
echoed parent generation.
---
Nitpick comments:
In `@internal/scheduling/reservations/commitments/api/change_commitments.go`:
- Around line 280-347: In watchCRsUntilReady, make the implicit
"Reason=Reserving" behavior explicit so future readers understand we
intentionally continue polling; either add a clear comment next to the switch
block (or add an explicit case like `case cond.Status==metav1.ConditionFalse &&
cond.Reason==v1alpha1.CommittedResourceReasonReserving: // keep waiting —
controller is placing slots` with no delete) to document that Reserving is not
terminal and should be retried, referencing the switch in watchCRsUntilReady
that currently only handles Planned/Rejected/True.
In `@internal/scheduling/reservations/commitments/syncer_monitor.go`:
- Around line 38-42: The histogram buckets for syncDuration (created via
prometheus.NewHistogram with HistogramOpts for
"cortex_committed_resource_syncer_duration_seconds") currently start at 0.5s;
add a smaller bucket (e.g., 0.1) to the Buckets slice so sub-500ms syncs are
better resolved — update the Buckets value in the syncDuration initialization to
include 0.1 (or another appropriate low bound) while preserving the existing
buckets.
In `@internal/scheduling/reservations/commitments/syncer_test.go`:
- Around line 788-799: The test declares "var dto dto.Metric" which shadows the
imported dto package; rename the local variable (e.g., "var metric dto.Metric"
or "metricDTO := dto.Metric{}") and update subsequent references (change
m.Write(&dto) to m.Write(&metric) and dto.GetGauge() to metric.GetGauge()) so
the imported package name is not shadowed while keeping the type as dto.Metric
and the logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8afbe690-ff6f-4a48-9a13-7b1e39d6e685
📒 Files selected for processing (17)
api/v1alpha1/reservation_types.gocmd/manager/main.gohelm/bundles/cortex-nova/alerts/nova.alerts.yamlhelm/library/cortex/files/crds/cortex.cloud_reservations.yamlinternal/scheduling/reservations/commitments/api/change_commitments.gointernal/scheduling/reservations/commitments/api/change_commitments_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/committed_resource_integration_test.gointernal/scheduling/reservations/commitments/field_index.gointernal/scheduling/reservations/commitments/reservation_controller.gointernal/scheduling/reservations/commitments/reservation_manager.gointernal/scheduling/reservations/commitments/state.gointernal/scheduling/reservations/commitments/syncer.gointernal/scheduling/reservations/commitments/syncer_monitor.gointernal/scheduling/reservations/commitments/syncer_monitor_test.gointernal/scheduling/reservations/commitments/syncer_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/scheduling/reservations/commitments/syncer_monitor_test.go
- internal/scheduling/reservations/commitments/committed_resource_controller_test.go
- internal/scheduling/reservations/commitments/committed_resource_integration_test.go
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
internal/scheduling/reservations/commitments/syncer.go (1)
320-323:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTreat empty
schedulingDomainas legacy Nova, not foreign.
CommittedResourceSpec.SchedulingDomainis optional, so older Nova CRs can legitimately have this field unset. The current check skips those objects from stale counting and expiry GC entirely, which leaves migrated CRs and their child reservations behind.Suggested fix
- if cr.Spec.SchedulingDomain != v1alpha1.SchedulingDomainNova { + if cr.Spec.SchedulingDomain != "" && cr.Spec.SchedulingDomain != v1alpha1.SchedulingDomainNova { continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/syncer.go` around lines 320 - 323, The loop over existingCRs.Items in syncer.go currently skips CRs whose Spec.SchedulingDomain is not v1alpha1.SchedulingDomainNova, which incorrectly treats unset (empty) schedulingDomain as foreign; change the check in the loop that uses cr.Spec.SchedulingDomain so that an empty string is considered equivalent to v1alpha1.SchedulingDomainNova (i.e., only continue/skip when SchedulingDomain is non-empty and not equal to v1alpha1.SchedulingDomainNova) so migrated/legacy Nova CRs are included in stale counting and expiry GC.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@helm/bundles/cortex-nova/values.yaml`:
- Around line 144-171: Docs still reference old flat keys (e.g.
committedResourceRequeueIntervalActive) but the values.yaml now uses nested keys
like reservationController.flavorGroupPipelines,
reservationController.pipelineDefault,
reservationController.requeueIntervalActive,
reservationController.requeueIntervalRetry,
committedResourceController.requeueIntervalRetry, and api.watchTimeout /
api.watchPollInterval / api.enableChangeCommitments / api.enableReportUsage /
api.enableReportCapacity; update the documentation to replace every mention of
the old flat keys with the new nested keys, add a short migration table mapping
each old key to its new counterpart (including defaults), and include a brief
example snippet showing the new nested structure so operators can migrate
configuration without misconfiguration.
In `@internal/scheduling/reservations/commitments/api/change_commitments.go`:
- Around line 220-231: The code performs api.client.Delete inside the batch loop
(see isDelete handling that sets snap.wasDeleted and calls api.client.Delete),
which breaks rollback semantics; instead, stop performing destructive deletes
during the watch/processing loop—only mark the snapshot (snap.wasDeleted /
snap.prevSpec) and append to snapshots, and collect the CR names to delete
separately; after the batch fully succeeds (commit phase), iterate the collected
delete list and call api.client.Delete (or perform the irreversible controller
action). Alternatively, implement a reversible state transition during
processing (e.g., set a "toDelete" flag on the commitment resource) and only
execute api.client.Delete when commitSuccessfully; also remove or update any
code paths relying on immediate deletion (see rollbackCR() and its
Create/AlreadyExists behavior) so rollback will fully restore previous state
including finalizers/status by avoiding mid-batch deletes; apply the same change
to the similar block referenced around lines 391-403.
In `@internal/scheduling/reservations/commitments/syncer.go`:
- Around line 340-368: The orphan-sweep loop deletes Reservations when their
CommitmentUUID is not in activeCommitments but doesn't account for UUIDs that
were intentionally skipped this cycle; update the loop (iterating
existingReservations.Items) to consult the skippedUUIDs set/map before deleting
(i.e. if skippedUUIDs[commitmentUUID] is true, log and continue), so you only
delete reservations whose commitments are truly inactive and not merely skipped
this run; keep the existing logging and s.Delete(client.IgnoreNotFound(...))
behavior for actual deletions.
- Around line 298-305: activeCommitments is currently populated with all
commitment UUIDs including terminal ones (expired/superseded) which prevents
orphan sweep; when building activeCommitments only add non-terminal commitments:
in the loop over commitmentResult.states check the state field (e.g.
state.IsTerminal / state.Status) and skip states that are terminal before doing
activeCommitments[state.CommitmentUUID] = true, and likewise when adding from
commitmentResult.skippedUUIDs ensure you exclude UUIDs that correspond to
terminal commitments (look up their state/reason in commitmentResult or use the
same terminal check used above) so terminal/expired/superseded commitments are
not considered active for orphan-reservation liveness (also apply the same
filtering in the later block around lines 340-368 where activeCommitments is
augmented).
---
Duplicate comments:
In `@internal/scheduling/reservations/commitments/syncer.go`:
- Around line 320-323: The loop over existingCRs.Items in syncer.go currently
skips CRs whose Spec.SchedulingDomain is not v1alpha1.SchedulingDomainNova,
which incorrectly treats unset (empty) schedulingDomain as foreign; change the
check in the loop that uses cr.Spec.SchedulingDomain so that an empty string is
considered equivalent to v1alpha1.SchedulingDomainNova (i.e., only continue/skip
when SchedulingDomain is non-empty and not equal to
v1alpha1.SchedulingDomainNova) so migrated/legacy Nova CRs are included in stale
counting and expiry GC.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 617cd1a5-2e65-4834-b070-30d773843819
📒 Files selected for processing (20)
cmd/manager/main.gohelm/bundles/cortex-nova/values.yamlinternal/scheduling/reservations/commitments/api/change_commitments.gointernal/scheduling/reservations/commitments/api/change_commitments_e2e_test.gointernal/scheduling/reservations/commitments/api/change_commitments_test.gointernal/scheduling/reservations/commitments/api/handler.gointernal/scheduling/reservations/commitments/api/info.gointernal/scheduling/reservations/commitments/api/report_capacity.gointernal/scheduling/reservations/commitments/api/report_usage.gointernal/scheduling/reservations/commitments/api/report_usage_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/committed_resource_integration_test.gointernal/scheduling/reservations/commitments/config.gointernal/scheduling/reservations/commitments/integration_test.gointernal/scheduling/reservations/commitments/reservation_controller.gointernal/scheduling/reservations/commitments/reservation_controller_test.gointernal/scheduling/reservations/commitments/reservation_manager.gointernal/scheduling/reservations/commitments/state.gointernal/scheduling/reservations/commitments/syncer.go
🚧 Files skipped from review as they are similar to previous changes (4)
- cmd/manager/main.go
- internal/scheduling/reservations/commitments/committed_resource_controller_test.go
- internal/scheduling/reservations/commitments/committed_resource_integration_test.go
- internal/scheduling/reservations/commitments/reservation_controller.go
| // Build set of commitment UUIDs we should have (processed + skipped) | ||
| activeCommitments := make(map[string]bool) | ||
| for _, state := range commitmentResult.states { | ||
| activeCommitments[state.CommitmentUUID] = true | ||
| } | ||
| // Also include skipped commitments - don't delete their CRDs | ||
| for uuid := range commitmentResult.skippedUUIDs { | ||
| activeCommitments[uuid] = true | ||
| } |
There was a problem hiding this comment.
Exclude terminal commitments from orphan-reservation liveness.
activeCommitments currently includes expired/superseded UUIDs, but updateCommittedResourceIfExists() intentionally does nothing when no CommittedResource exists locally. In the migration case this leaves legacy Reservation CRs permanently exempt from the orphan sweep, so expired commitments can keep stranding capacity until Limes stops returning them.
Suggested direction
- activeCommitments := make(map[string]bool)
+ activeCommitments := make(map[string]bool)
+ liveReservationCommitments := make(map[string]bool)
for _, state := range commitmentResult.states {
activeCommitments[state.CommitmentUUID] = true
+ if !isTerminalCommitment(state) {
+ liveReservationCommitments[state.CommitmentUUID] = true
+ }
}
for uuid := range commitmentResult.skippedUUIDs {
activeCommitments[uuid] = true
+ liveReservationCommitments[uuid] = true
}
@@
- if !activeCommitments[commitmentUUID] {
+ if !liveReservationCommitments[commitmentUUID] {Also applies to: 340-368
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/commitments/syncer.go` around lines 298 -
305, activeCommitments is currently populated with all commitment UUIDs
including terminal ones (expired/superseded) which prevents orphan sweep; when
building activeCommitments only add non-terminal commitments: in the loop over
commitmentResult.states check the state field (e.g. state.IsTerminal /
state.Status) and skip states that are terminal before doing
activeCommitments[state.CommitmentUUID] = true, and likewise when adding from
commitmentResult.skippedUUIDs ensure you exclude UUIDs that correspond to
terminal commitments (look up their state/reason in commitmentResult or use the
same terminal check used above) so terminal/expired/superseded commitments are
not considered active for orphan-reservation liveness (also apply the same
filtering in the later block around lines 340-368 where activeCommitments is
augmented).
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
tools/visualize-committed-resources/main.go (2)
393-398: 💤 Low valueMinor: Redundant string builder pattern.
The variable
resourcesis initialized as empty string, then astrings.Builderis used, and finally concatenated back. This can be simplified.Simplified pattern
- resources := "" - var resourcesSb391 strings.Builder + var resourcesSb strings.Builder for rname, qty := range res.Spec.Resources { - fmt.Fprintf(&resourcesSb391, "%s=%s ", rname, qty.String()) + fmt.Fprintf(&resourcesSb, "%s=%s ", rname, qty.String()) } - resources += resourcesSb391.String() + resources := resourcesSb.String()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/visualize-committed-resources/main.go` around lines 393 - 398, The code redundantly initializes resources and then builds the same string via strings.Builder named resourcesSb391; simplify by removing the interim resources variable and write directly into a single strings.Builder (or directly build the string) when iterating res.Spec.Resources in the loop that currently uses resourcesSb391 and then assigns resources += resourcesSb391.String(); update references to use the single builder's String() result (or the direct string) instead of the redundant concatenation to eliminate resourcesSb391 and the empty resources initialization.
426-431: 💤 Low valueSame pattern duplicated - consider extracting helper.
The same
strings.Builderpattern for formatting resources is repeated here (and at line 393-398). Consider extracting a small helper function if this pattern appears elsewhere.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/visualize-committed-resources/main.go` around lines 426 - 431, The resource-string building logic is duplicated (see the block using resourcesSb422 to format alloc.Resources into the resources string and the similar block at lines 393-398); extract it into a small helper function (e.g., formatResources(resources map[string]Quantity) string) and replace both occurrences to call that helper, keeping the same formatting (name=qty.String() separated by spaces) and removing the local strings.Builder duplication in main.go.internal/scheduling/reservations/commitments/api/change_commitments.go (1)
353-362: 💤 Low valueConsider logging transient Get errors during polling.
When
Get()fails at line 355-356, the error is silently ignored withcontinue. While this is reasonable for transient network issues, persistent errors would cause the loop to timeout without any indication of why specific CRs couldn't be checked. Consider logging at a verbose level (e.g.,V(2)) for debugging purposes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/api/change_commitments.go` around lines 353 - 362, When k8sClient.Get(ctx, types.NamespacedName{Name: name}, cr) fails inside the pending loop, the error is currently ignored; update the loop in change_commitments.go to log that transient Get error at a verbose level (e.g., klog.V(2).Infof or controller-runtime logger V(2)) including the CommittedResource name and the error so operators can diagnose persistent failures while still continuing to retry; keep the existing continue behavior after logging and use the same symbols (pending, k8sClient.Get, name, cr, v1alpha1.CommittedResource, meta.FindStatusCondition) to locate the insertion point.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/reservations/commitments/config.go`:
- Around line 73-79: The API boolean flags in commitmentsConfig are left at Go
zero values (false) when unmarshalled, causing endpoints to be silently disabled
because ApplyDefaults() is only called later for controllers; fix by ensuring
defaults are applied before API initialization: call
commitmentsConfig.ApplyDefaults() immediately after loading via
conf.GetConfigOrDie[commitments.Config]() (or construct the API with
commitmentsapi.DefaultAPIConfig().Merge(commitmentsConfig.API) when calling
commitmentsapi.NewAPIWithConfig), so that the API uses DefaultAPIConfig values
unless explicitly overridden.
In `@internal/scheduling/reservations/commitments/e2e_checks.go`:
- Around line 59-60: The E2E checks use http.DefaultClient.Do(...) (seen as
http.DefaultClient.Do(httpReq) wrapped by must.Return) which has no timeout;
replace these calls (including the other occurrences) with a bounded request
(either create an http.Client with a Timeout set and call client.Do(httpReq) or
wrap the request context using context.WithTimeout(httpReq.Context(), timeout)
and call http.DefaultClient.Do(reqWithCtx)), update the code paths that use
must.Return to accept the error returned by the timed request, and keep the
existing resp.Body.Close() defer; reference the existing symbols
http.DefaultClient.Do, httpReq, must.Return and resp.Body.Close when making the
change.
- Around line 154-157: In the round-trip check in
internal/scheduling/reservations/commitments/e2e_checks.go, don't treat every
non-empty rejectionReason as acceptable; instead implement an allowlist check on
rejectionReason (e.g., match known capacity-related tokens like "no capacity" or
"insufficient capacity") and only log+return as continuing for those allowed
reasons; for any other rejectionReason, log an error (slog.Error) with the
resourceName and reason and fail the check (return an error or mark the check as
failed) so validation/contract regressions do not pass silently.
- Around line 161-189: The usage check currently runs before registering cleanup
so if e2eCheckUsageAPI(...) panics the test commitment is leaked; after the
commitment is accepted (right after the creation/acceptance success) register a
defer that builds the same deleteReq (using InfoVersion, AZ, projectID,
resourceName, testUUID, amount, expiresAt) and calls
e2eSendChangeCommitments(ctx, baseURL, deleteReq) to ensure deletion always
runs; place this defer immediately after the acceptance point so it executes
even if e2eCheckUsageAPI or later logic fails.
---
Nitpick comments:
In `@internal/scheduling/reservations/commitments/api/change_commitments.go`:
- Around line 353-362: When k8sClient.Get(ctx, types.NamespacedName{Name: name},
cr) fails inside the pending loop, the error is currently ignored; update the
loop in change_commitments.go to log that transient Get error at a verbose level
(e.g., klog.V(2).Infof or controller-runtime logger V(2)) including the
CommittedResource name and the error so operators can diagnose persistent
failures while still continuing to retry; keep the existing continue behavior
after logging and use the same symbols (pending, k8sClient.Get, name, cr,
v1alpha1.CommittedResource, meta.FindStatusCondition) to locate the insertion
point.
In `@tools/visualize-committed-resources/main.go`:
- Around line 393-398: The code redundantly initializes resources and then
builds the same string via strings.Builder named resourcesSb391; simplify by
removing the interim resources variable and write directly into a single
strings.Builder (or directly build the string) when iterating res.Spec.Resources
in the loop that currently uses resourcesSb391 and then assigns resources +=
resourcesSb391.String(); update references to use the single builder's String()
result (or the direct string) instead of the redundant concatenation to
eliminate resourcesSb391 and the empty resources initialization.
- Around line 426-431: The resource-string building logic is duplicated (see the
block using resourcesSb422 to format alloc.Resources into the resources string
and the similar block at lines 393-398); extract it into a small helper function
(e.g., formatResources(resources map[string]Quantity) string) and replace both
occurrences to call that helper, keeping the same formatting (name=qty.String()
separated by spaces) and removing the local strings.Builder duplication in
main.go.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 52be840f-aca1-4979-92e5-7270abdbc373
📒 Files selected for processing (11)
api/v1alpha1/reservation_types.godocs/reservations/committed-resource-reservations.mdhelm/bundles/cortex-nova/values.yamlhelm/library/cortex/files/crds/cortex.cloud_reservations.yamlinternal/scheduling/reservations/commitments/api/change_commitments.gointernal/scheduling/reservations/commitments/api/change_commitments_e2e_test.gointernal/scheduling/reservations/commitments/config.gointernal/scheduling/reservations/commitments/e2e_checks.gointernal/scheduling/reservations/commitments/reservation_controller.gointernal/scheduling/reservations/commitments/state.gotools/visualize-committed-resources/main.go
✅ Files skipped from review due to trivial changes (2)
- api/v1alpha1/reservation_types.go
- internal/scheduling/reservations/commitments/reservation_controller.go
🚧 Files skipped from review as they are similar to previous changes (3)
- helm/bundles/cortex-nova/values.yaml
- internal/scheduling/reservations/commitments/api/change_commitments_e2e_test.go
- helm/library/cortex/files/crds/cortex.cloud_reservations.yaml
| resp := must.Return(http.DefaultClient.Do(httpReq)) | ||
| defer resp.Body.Close() |
There was a problem hiding this comment.
Outbound HTTP calls have no explicit timeout
Using http.DefaultClient without a timeout can hang these checks indefinitely when the endpoint is slow/stuck. Add a bounded timeout client (or per-request context timeout) for all E2E HTTP calls.
Proposed timeout client
+var e2eHTTPClient = &http.Client{
+ Timeout: 15 * time.Second,
+}
...
- resp := must.Return(http.DefaultClient.Do(httpReq))
+ resp := must.Return(e2eHTTPClient.Do(httpReq))Also applies to: 203-204, 223-224, 243-244
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/commitments/e2e_checks.go` around lines 59 -
60, The E2E checks use http.DefaultClient.Do(...) (seen as
http.DefaultClient.Do(httpReq) wrapped by must.Return) which has no timeout;
replace these calls (including the other occurrences) with a bounded request
(either create an http.Client with a Timeout set and call client.Do(httpReq) or
wrap the request context using context.WithTimeout(httpReq.Context(), timeout)
and call http.DefaultClient.Do(reqWithCtx)), update the code paths that use
must.Return to accept the error returned by the timed request, and keep the
existing resp.Body.Close() defer; reference the existing symbols
http.DefaultClient.Do, httpReq, must.Return and resp.Body.Close when making the
change.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cmd/manager/main.go (1)
541-542: ⚡ Quick winRemove the redundant config re-load; reuse the outer
commitmentsConfig.
commitmentsConfigis already loaded and defaulted at lines 365–366 (outermain()scope). The inner:=at line 541 shadows it with an identical value, adding an unnecessary config read and a misleading shadow variable.♻️ Proposed refactor
if slices.Contains(mainConfig.EnabledControllers, "committed-resource-reservations-controller") { setupLog.Info("enabling controller", "controller", "committed-resource-reservations-controller") monitor := reservations.NewMonitor(multiclusterClient) metrics.Registry.MustRegister(&monitor) - commitmentsConfig := conf.GetConfigOrDie[commitments.Config]() - commitmentsConfig.ApplyDefaults() if err := (&commitments.CommitmentReservationController{🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/manager/main.go` around lines 541 - 542, The inner redundant reload is shadowing the outer commitmentsConfig—remove the inner short-declaration (the second "commitmentsConfig := conf.GetConfigOrDie[commitments.Config]() / commitmentsConfig.ApplyDefaults()") and reuse the already-loaded outer variable; simply call commitmentsConfig.ApplyDefaults() (or nothing if already applied) in the inner scope where the duplicate appears so there is no shadowed commitmentsConfig and no extra config read.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/reservations/commitments/config.go`:
- Around line 74-88: The three API feature flags (EnableChangeCommitments,
EnableReportUsage, EnableReportCapacity) must be changed from bool to *bool so
you can distinguish "unset" vs intentionally false; update the APIConfig struct,
DefaultAPIConfig() to return pointer values (e.g., ptr(true) where default is
true), and modify ApplyDefaults() (the code in config.go that currently uses `if
!c.API.EnableX`) to only apply defaults when the pointer is nil (e.g., `if
c.API.EnableX == nil { c.API.EnableX = d.EnableX }`). Also update any callers
(for example in api/handler.go) to safely check dereferenced values or add a
convenience accessor on APIConfig to read the effective boolean. Ensure
WatchTimeout/WatchPollInterval logic remains the same.
---
Nitpick comments:
In `@cmd/manager/main.go`:
- Around line 541-542: The inner redundant reload is shadowing the outer
commitmentsConfig—remove the inner short-declaration (the second
"commitmentsConfig := conf.GetConfigOrDie[commitments.Config]() /
commitmentsConfig.ApplyDefaults()") and reuse the already-loaded outer variable;
simply call commitmentsConfig.ApplyDefaults() (or nothing if already applied) in
the inner scope where the duplicate appears so there is no shadowed
commitmentsConfig and no extra config read.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5aa32de4-0cdb-4357-a10c-3d7e9a0cb7de
📒 Files selected for processing (5)
cmd/manager/main.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/config.gointernal/scheduling/reservations/commitments/e2e_checks.gointernal/scheduling/reservations/commitments/reservation_controller.go
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/scheduling/reservations/commitments/reservation_controller.go
- internal/scheduling/reservations/commitments/committed_resource_controller.go
- internal/scheduling/reservations/commitments/e2e_checks.go
| if !c.API.EnableChangeCommitments { | ||
| c.API.EnableChangeCommitments = d.EnableChangeCommitments | ||
| } | ||
| if c.RequeueIntervalRetry == 0 { | ||
| c.RequeueIntervalRetry = defaults.RequeueIntervalRetry | ||
| if !c.API.EnableReportUsage { | ||
| c.API.EnableReportUsage = d.EnableReportUsage | ||
| } | ||
| if c.RequeueIntervalGracePeriod == 0 { | ||
| c.RequeueIntervalGracePeriod = defaults.RequeueIntervalGracePeriod | ||
| if !c.API.EnableReportCapacity { | ||
| c.API.EnableReportCapacity = d.EnableReportCapacity | ||
| } | ||
| if c.AllocationGracePeriod == 0 { | ||
| c.AllocationGracePeriod = defaults.AllocationGracePeriod | ||
| if c.API.WatchTimeout.Duration == 0 { | ||
| c.API.WatchTimeout = d.WatchTimeout | ||
| } | ||
| if c.PipelineDefault == "" { | ||
| c.PipelineDefault = defaults.PipelineDefault | ||
| if c.API.WatchPollInterval.Duration == 0 { | ||
| c.API.WatchPollInterval = d.WatchPollInterval | ||
| } |
There was a problem hiding this comment.
Operators cannot explicitly disable the API feature flags — use *bool to distinguish "unset" from "intentionally false".
The if !c.API.EnableX guard is semantically equivalent to if c.API.EnableX == false, which applies to both the zero-value from unmarshalling an omitted field and an operator-supplied enableChangeCommitments: false. Because DefaultAPIConfig() always returns true, any explicit false in the configuration file is silently overridden to true. The endpoints can never be disabled via configuration.
The standard Go idiom for optional booleans is *bool:
🛠️ Proposed fix using pointer booleans
Change the three flag fields in APIConfig to *bool:
- EnableChangeCommitments bool `json:"enableChangeCommitments"`
- EnableReportUsage bool `json:"enableReportUsage"`
- EnableReportCapacity bool `json:"enableReportCapacity"`
+ EnableChangeCommitments *bool `json:"enableChangeCommitments"`
+ EnableReportUsage *bool `json:"enableReportUsage"`
+ EnableReportCapacity *bool `json:"enableReportCapacity"`Update DefaultAPIConfig():
+ t := true
return APIConfig{
- EnableChangeCommitments: true,
- EnableReportUsage: true,
- EnableReportCapacity: true,
+ EnableChangeCommitments: &t,
+ EnableReportUsage: &t,
+ EnableReportCapacity: &t,
WatchTimeout: metav1.Duration{Duration: 10 * time.Second},
WatchPollInterval: metav1.Duration{Duration: 500 * time.Millisecond},
}Update ApplyDefaults():
- if !c.API.EnableChangeCommitments {
- c.API.EnableChangeCommitments = d.EnableChangeCommitments
- }
- if !c.API.EnableReportUsage {
- c.API.EnableReportUsage = d.EnableReportUsage
- }
- if !c.API.EnableReportCapacity {
- c.API.EnableReportCapacity = d.EnableReportCapacity
- }
+ if c.API.EnableChangeCommitments == nil {
+ c.API.EnableChangeCommitments = d.EnableChangeCommitments
+ }
+ if c.API.EnableReportUsage == nil {
+ c.API.EnableReportUsage = d.EnableReportUsage
+ }
+ if c.API.EnableReportCapacity == nil {
+ c.API.EnableReportCapacity = d.EnableReportCapacity
+ }Callers checking the flags (e.g., in api/handler.go) would dereference safely:
if api.config.EnableChangeCommitments != nil && *api.config.EnableChangeCommitments { ... }Or add a convenience accessor on APIConfig.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/commitments/config.go` around lines 74 - 88,
The three API feature flags (EnableChangeCommitments, EnableReportUsage,
EnableReportCapacity) must be changed from bool to *bool so you can distinguish
"unset" vs intentionally false; update the APIConfig struct, DefaultAPIConfig()
to return pointer values (e.g., ptr(true) where default is true), and modify
ApplyDefaults() (the code in config.go that currently uses `if !c.API.EnableX`)
to only apply defaults when the pointer is nil (e.g., `if c.API.EnableX == nil {
c.API.EnableX = d.EnableX }`). Also update any callers (for example in
api/handler.go) to safely check dereferenced values or add a convenience
accessor on APIConfig to read the effective boolean. Ensure
WatchTimeout/WatchPollInterval logic remains the same.
Test Coverage ReportTest Coverage 📊: 69.1% |
Introduces the CommittedResource CRD as the durable contract between the LIQUID API and the reservation system, replacing direct Reservation writes with a controller-owned lifecycle. Accepting or rejecting a commitment is now a write to a CommittedResource CRD followed by watching its status — the API layer no longer touches Reservation CRDs directly.