feat: track VM placements in reservations and classify no-host-found#847
feat: track VM placements in reservations and classify no-host-found#847mblos wants to merge 6 commits into
Conversation
📝 WalkthroughWalkthroughAdds committed-resource allocation recording and no-host-found observability to Nova scheduling: slot selection, remaining-capacity calculation, classification (A/B/C/D), Prometheus counter, controller wiring to record allocations and log no-host-found, and unit tests. ChangesCR Allocation Recording and No-Host-Found Observability
Sequence Diagram(s)sequenceDiagram
participant api as ProcessNewDecisionFromAPI
participant process as process
participant history as upsertHistory
participant cr as recordCRAllocation
participant nohf as logNoHostFound
api->>process: call process (returns request, error)
process-->>api: request, error
alt Success
api->>history: upsertHistory with request
api->>api: Check isUserVMPlacement intent
alt User VM
api->>cr: recordCRAllocation (best-effort)
end
alt No target host
api->>nohf: logNoHostFound (classify and increment)
end
else Error
api->>api: Handle error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/scheduling/nova/cr_allocation.go`:
- Around line 34-38: The code currently treats any error from
resolveFlavorGroup(...) as "flavor not in any group" and returns silently;
change this by distinguishing a genuine "not found" result from transient
errors: use errors.Is(err, <appropriate not-found sentinel>) or check for a
typed NotFound error after calling resolveFlavorGroup(ctx, flavorName) and only
log the PAYG placement / return for the not-found case, while for other errors
log them as errors (including err) and propagate or handle them (e.g., return
the error or retry) instead of classifying as PAYG; apply the same fix to the
other occurrence around the resolveFlavorGroup call at the later block (lines
176-185).
- Around line 118-156: pickReservationSlot currently only enforces memory and
treats CPU as a tiebreaker, allowing placements into CPU-exhausted slots; change
pickReservationSlot to accept the VM CPU requirement (e.g., vmCPU int64 or
milli-CPUs) and compute remCPU as you already do, then skip any candidate where
remCPU < vmCPU (i.e., enforce CPU as a hard constraint before considering
memory/tiebreakers). Update all call sites that pass only vmMemoryBytes to
provide the VM CPU requirement, and keep the existing tiebreaker logic (remMem,
remCPU, res.Name) unchanged; use symbols from the diff like pickReservationSlot,
reservationRemainingMemory, res.Spec.Resources, cr.Allocations and remCPU to
locate and update the checks.
In `@internal/scheduling/nova/cr_metrics.go`:
- Around line 77-82: The current short-circuit after calling
c.resolveFlavorGroup(ctx, flavorName) treats every error as a PAYG "no host"
case and returns, losing real lookup failures; change the handling so you only
short-circuit for the specific PAYG sentinel/indicator returned by
resolveFlavorGroup (e.g., compare against the sentinel error or check an
exported method/flag on the error), and for any other error path log the actual
error (include err) and either propagate or record a failing metric. Update the
block around the call to resolveFlavorGroup (where flavorGroupName, _, err :=
c.resolveFlavorGroup(...) and the log.V(1).Info("no-host-found: PAYG...") are)
to distinguish PAYG vs lookup errors and ensure non-PAYG errors are surfaced in
logs/metrics instead of being silently dropped.
In `@internal/scheduling/nova/filter_weigher_pipeline_controller.go`:
- Around line 81-85: The early read of c.PipelineConfigs using
decision.Spec.PipelineRef.Name races with concurrent updates because it happens
before acquiring processMu; fix by protecting the access with the same mutex
(either grab processMu.Lock()/Unlock() around the lookup or use
processMu.RLock()/RUnlock() if a read lock is available) so that the map lookup
for c.PipelineConfigs is done while holding the lock and the map cannot be
mutated concurrently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6caa6846-9c68-49eb-aba9-4c0590ccfa33
📒 Files selected for processing (6)
cmd/manager/main.gointernal/scheduling/nova/cr_allocation.gointernal/scheduling/nova/cr_metrics.gointernal/scheduling/nova/cr_metrics_test.gointernal/scheduling/nova/filter_weigher_pipeline_controller.gointernal/scheduling/nova/filter_weigher_pipeline_controller_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/scheduling/nova/cr_allocation.go (1)
188-199:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
resolveFlavorGroupstill masks allFindFlavorInGroupserrors as PAYG.At Line 195-196, every error returned by
FindFlavorInGroupsis converted toerrFlavorNotInGroup. The controller now correctly distinguishes viaerrors.Is(Line 37), but only because this helper always reports "not in group" — including for any transient/typed errorsFindFlavorInGroupsmay emit (e.g., knowledge unready, lookup failure). This is the same root cause flagged previously; the controller-side error check is necessary but not sufficient if the helper itself flattens errors.Please verify which error types
FindFlavorInGroupscan return and whether a sentinel/typed "not found" exists:#!/bin/bash # Locate FindFlavorInGroups implementation and inspect its return paths. ast-grep --pattern $'func FindFlavorInGroups($_, $_) ($_, $_, $_) { $$$ }' echo "---" # Fallback if signature differs (e.g., generics or different arity). rg -nP -C5 '\bfunc\s+FindFlavorInGroups\s*\(' echo "---" # Find error sentinels exported from the reservations package. rg -nP -C2 '^\s*var\s+Err\w+\s*=' --type=go -g 'internal/scheduling/reservations/**'🛡️ Proposed fix to preserve non-"not found" errors
func (c *FilterWeigherPipelineController) resolveFlavorGroup(ctx context.Context, flavorName string) (string, *compute.FlavorInGroup, error) { fgClient := reservations.FlavorGroupKnowledgeClient{Client: c.Client} flavorGroups, err := fgClient.GetAllFlavorGroups(ctx, nil) if err != nil { return "", nil, err } groupName, flavor, err := reservations.FindFlavorInGroups(flavorName, flavorGroups) if err != nil { - return "", nil, errFlavorNotInGroup + // Map only the "not found" sentinel to errFlavorNotInGroup; propagate other errors verbatim. + if errors.Is(err, reservations.ErrFlavorNotFound) { // adjust to actual sentinel name + return "", nil, errFlavorNotInGroup + } + return "", nil, err } return groupName, flavor, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/nova/cr_allocation.go` around lines 188 - 199, resolveFlavorGroup currently replaces every error from reservations.FindFlavorInGroups with errFlavorNotInGroup, masking transient/typed errors; change the error handling so you only map the specific "flavor not in group" sentinel to errFlavorNotInGroup and otherwise return the original error from FindFlavorInGroups. Concretely, call reservations.FindFlavorInGroups(...), and if err != nil use errors.Is(err, reservations.<NotFoundSentinelName>) (or compare to the package's exported sentinel) to detect the not-found case and return errFlavorNotInGroup, otherwise return the original err; keep the rest of the function (fgClient/GetAllFlavorGroups and the success return) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@internal/scheduling/nova/cr_allocation.go`:
- Around line 188-199: resolveFlavorGroup currently replaces every error from
reservations.FindFlavorInGroups with errFlavorNotInGroup, masking
transient/typed errors; change the error handling so you only map the specific
"flavor not in group" sentinel to errFlavorNotInGroup and otherwise return the
original error from FindFlavorInGroups. Concretely, call
reservations.FindFlavorInGroups(...), and if err != nil use errors.Is(err,
reservations.<NotFoundSentinelName>) (or compare to the package's exported
sentinel) to detect the not-found case and return errFlavorNotInGroup, otherwise
return the original err; keep the rest of the function
(fgClient/GetAllFlavorGroups and the success return) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 29da7511-aae0-424f-b457-997a84a63636
📒 Files selected for processing (4)
internal/scheduling/nova/cr_allocation.gointernal/scheduling/nova/cr_metrics.gointernal/scheduling/nova/filter_weigher_pipeline_controller.gointernal/scheduling/nova/filter_weigher_pipeline_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/scheduling/nova/cr_metrics.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/scheduling/nova/cr_metrics_test.go`:
- Around line 379-386: The test's empty expectedCase branch only checks the
counter totals for the current flavorGroup ("m1") and can miss increments under
other flavor-group labels; update the assertion to sum/test counter values
across all possible flavor-group label values (including the empty/unknown
flavor group used elsewhere in the test) rather than only the local flavorGroup
variable by iterating over the set of flavor groups and calling
testutil.ToFloat64(counter.WithLabelValues("<label>", ...,
string(api.CreateIntent))) for each label (using the same label names
"A","B","C","D" and the same metric label order) and assert the grand total is
zero.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 62da9fad-ed34-4b83-a1eb-de9b82f5ae2d
📒 Files selected for processing (5)
cmd/manager/main.gohelm/bundles/cortex-nova/values.yamlinternal/scheduling/nova/cr_metrics_test.gointernal/scheduling/nova/filter_weigher_pipeline_controller.gopkg/conf/features.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/scheduling/nova/filter_weigher_pipeline_controller.go
| if tt.expectedCase == "" { | ||
| total := testutil.ToFloat64(counter.WithLabelValues("A", flavorGroup, string(api.CreateIntent))) + | ||
| testutil.ToFloat64(counter.WithLabelValues("B", flavorGroup, string(api.CreateIntent))) + | ||
| testutil.ToFloat64(counter.WithLabelValues("C", flavorGroup, string(api.CreateIntent))) + | ||
| testutil.ToFloat64(counter.WithLabelValues("D", flavorGroup, string(api.CreateIntent))) | ||
| if total != 0 { | ||
| t.Errorf("expected no counter increment, got total %.0f", total) | ||
| } |
There was a problem hiding this comment.
Assert zero increments across all label combinations, not only one flavor group.
The "expectedCase == \"\"" branch only checks flavorGroup == "m1". A counter increment under another flavor-group label (e.g., empty/unknown) would be missed.
✅ Suggested fix
if tt.expectedCase == "" {
- total := testutil.ToFloat64(counter.WithLabelValues("A", flavorGroup, string(api.CreateIntent))) +
- testutil.ToFloat64(counter.WithLabelValues("B", flavorGroup, string(api.CreateIntent))) +
- testutil.ToFloat64(counter.WithLabelValues("C", flavorGroup, string(api.CreateIntent))) +
- testutil.ToFloat64(counter.WithLabelValues("D", flavorGroup, string(api.CreateIntent)))
+ mfs, err := reg.Gather()
+ if err != nil {
+ t.Fatalf("gather metrics: %v", err)
+ }
+ var total float64
+ for _, mf := range mfs {
+ for _, m := range mf.GetMetric() {
+ total += m.GetCounter().GetValue()
+ }
+ }
if total != 0 {
t.Errorf("expected no counter increment, got total %.0f", total)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if tt.expectedCase == "" { | |
| total := testutil.ToFloat64(counter.WithLabelValues("A", flavorGroup, string(api.CreateIntent))) + | |
| testutil.ToFloat64(counter.WithLabelValues("B", flavorGroup, string(api.CreateIntent))) + | |
| testutil.ToFloat64(counter.WithLabelValues("C", flavorGroup, string(api.CreateIntent))) + | |
| testutil.ToFloat64(counter.WithLabelValues("D", flavorGroup, string(api.CreateIntent))) | |
| if total != 0 { | |
| t.Errorf("expected no counter increment, got total %.0f", total) | |
| } | |
| if tt.expectedCase == "" { | |
| mfs, err := reg.Gather() | |
| if err != nil { | |
| t.Fatalf("gather metrics: %v", err) | |
| } | |
| var total float64 | |
| for _, mf := range mfs { | |
| for _, m := range mf.GetMetric() { | |
| total += m.GetCounter().GetValue() | |
| } | |
| } | |
| if total != 0 { | |
| t.Errorf("expected no counter increment, got total %.0f", total) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/scheduling/nova/cr_metrics_test.go` around lines 379 - 386, The
test's empty expectedCase branch only checks the counter totals for the current
flavorGroup ("m1") and can miss increments under other flavor-group labels;
update the assertion to sum/test counter values across all possible flavor-group
label values (including the empty/unknown flavor group used elsewhere in the
test) rather than only the local flavorGroup variable by iterating over the set
of flavor groups and calling
testutil.ToFloat64(counter.WithLabelValues("<label>", ...,
string(api.CreateIntent))) for each label (using the same label names
"A","B","C","D" and the same metric label order) and assert the grand total is
zero.
Test Coverage ReportTest Coverage 📊: 69.4% |
Adds two pieces of observability for committed-resource scheduling outcomes.