Skip to content

Refactor health checks#1601

Open
ArangoGutierrez wants to merge 10 commits intoNVIDIA:mainfrom
ArangoGutierrez:healthcheck
Open

Refactor health checks#1601
ArangoGutierrez wants to merge 10 commits intoNVIDIA:mainfrom
ArangoGutierrez:healthcheck

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the device health-check lifecycle to use context.Context cancellation instead of a stop channel, and adds NVML mock infrastructure/tests to validate health-check behavior.

Changes:

  • Switch ResourceManager.CheckHealth to CheckHealth(ctx context.Context, unhealthy chan<- *Device) and propagate through plugin + managers.
  • Refactor NVML health monitoring into a provider (nvmlHealthProvider) and add tests covering XID handling and ERROR_NOT_SUPPORTED behavior.
  • Vendor in NVML mock packages (including a dgxa100 mock server) to support deterministic health-check testing.

Reviewed changes

Copilot reviewed 7 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/rm/rm.go Updates ResourceManager interface to use context.Context for health checks.
internal/rm/nvml_manager.go Adapts NVML RM CheckHealth to the new context-based signature.
internal/rm/tegra_manager.go Updates Tegra RM CheckHealth signature (still effectively disabled).
internal/rm/health.go Refactors health-check implementation into nvmlHealthProvider and context-driven loop.
internal/rm/health_test.go Adds tests using NVML mocks to validate event-driven health behavior.
internal/rm/rm_mock.go Regenerates moq mock to match the new CheckHealth signature.
internal/plugin/server.go Replaces stop channel with cancelable context + waitgroup for health goroutine lifecycle.
vendor/modules.txt Adds vendored paths for newly included NVML mock packages.
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/*.go Adds generated moq mocks for NVML interfaces used in tests.
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/mig-profile.go Adds MIG profile/placement data for the dgxa100 mock server.
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/dgxa100.go Adds dgxa100 NVML mock server used by health-check tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/plugin/server.go Outdated
Copy link
Copy Markdown
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArangoGutierrez. I have a minor question regarding the cleaning up of the context.

Comment thread internal/plugin/server.go Outdated
Comment on lines +279 to +281
// Capture references at start to avoid race with cleanup() which may nil these fields.
healthCtx := plugin.healthCtx
health := plugin.health
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add synchronisation to ensure that these are actually initialised by the time that ListAndWatch is called, or is it guaranteed that this will be called only after the context has been initialized. Looking at the cleanup code, it is not clear why we would set plugin.healthCtx to nil there.

Note that if we were to pass the context in to the server construction, then we would not have to do this since the lifetime would be controlled at the caller and we would not set the context to nil in plugin cleanup.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, now healthCtx and healthCancel now created in devicePluginForResource()

Comment thread internal/plugin/server.go Outdated
Comment on lines +127 to +128
plugin.healthCtx = nil
plugin.healthCancel = nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting these to nil now?

Copy link
Copy Markdown
Collaborator Author

@ArangoGutierrez ArangoGutierrez Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.
Fields preserved (not niled) for restart support

Comment thread internal/plugin/server.go Outdated
Comment on lines 165 to 167
if err != nil && !errors.Is(err, context.Canceled) {
klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about logging the happy path case? (Also in chich cases will we return contex.Cancelled?

Suggested change
if err != nil && !errors.Is(err, context.Canceled) {
klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
}
switch {
case err == nil:
fallthrough
case errors.Is(err, context.Cancelled):
klog.Infof("Stopping health check")
case err != nil:
klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Initialize healthCtx and healthCancel in devicePluginForResource()
instead of in initialize() to eliminate race condition where
ListAndWatch() might access healthCtx before initialization completes.

This addresses Elezar's concern about synchronization guarantees
and follows Go best practices for immutable initialization.

Refs: NVIDIA#1601
Task: 1/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Remove healthCtx and healthCancel initialization from initialize()
since they are now initialized at construction time in
devicePluginForResource().

This eliminates redundant initialization and ensures context
lifetime is tied to the plugin instance rather than the
start/stop cycle.

Refs: NVIDIA#1601
Task: 2/6
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Recreate healthCtx and healthCancel in cleanup() after cancellation
to support plugin restart. Remove nil assignments for these fields
as they need to persist across restart cycles.

This addresses Elezar's concern NVIDIA#2 about why we nil these fields -
we no longer do. The context is recreated fresh for each restart,
ensuring health checks work correctly when the plugin is restarted.

Fixes plugin restart blocker identified in architecture review.

Refs: NVIDIA#1601
Task: 3/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Close the health channel in cleanup() before niling to prevent
panics in ListAndWatch() if cleanup happens during channel read.

The WaitGroup ensures the health goroutine has completed before
we close the channel, making this operation safe.

Fixes critical blocker: health channel was never closed, leading
to potential panics and resource leaks.

Refs: NVIDIA#1601
Task: 4/6
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Add ok check when receiving from health channel to gracefully
handle channel closure during cleanup. This prevents potential
panics and ensures clean shutdown when the plugin stops.

Reading from a closed channel returns zero value and ok=false,
which we now check and return gracefully.

Refs: NVIDIA#1601
Task: 5/6
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Initialize healthCtx and healthCancel in devicePluginForResource()
instead of in initialize() to eliminate race condition where
ListAndWatch() might access healthCtx before initialization completes.

This addresses Elezar's concern about synchronization guarantees
and follows Go best practices for immutable initialization.

Refs: NVIDIA#1601
Task: 1/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Remove healthCtx and healthCancel initialization from initialize()
since they are now initialized at construction time in
devicePluginForResource().

This eliminates redundant initialization and ensures context
lifetime is tied to the plugin instance rather than the
start/stop cycle.

Refs: NVIDIA#1601
Task: 2/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Recreate healthCtx and healthCancel in cleanup() after cancellation
to support plugin restart. Remove nil assignments for these fields
as they need to persist across restart cycles.

This addresses Elezar's concern NVIDIA#2 about why we nil these fields -
we no longer do. The context is recreated fresh for each restart,
ensuring health checks work correctly when the plugin is restarted.

Fixes plugin restart blocker identified in architecture review.

Refs: NVIDIA#1601
Task: 3/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Close the health channel in cleanup() before niling to prevent
panics in ListAndWatch() if cleanup happens during channel read.

The WaitGroup ensures the health goroutine has completed before
we close the channel, making this operation safe.

Fixes critical blocker: health channel was never closed, leading
to potential panics and resource leaks.

Refs: NVIDIA#1601
Task: 4/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Add ok check when receiving from health channel to gracefully
handle channel closure during cleanup. This prevents potential
panics and ensures clean shutdown when the plugin stops.

Reading from a closed channel returns zero value and ok=false,
which we now check and return gracefully.

Refs: NVIDIA#1601
Task: 5/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Comment thread internal/plugin/server.go Outdated
Comment on lines +67 to +71
// healthCtx and healthCancel control the health check goroutine lifecycle.
// healthWg is used to wait for the health check goroutine to complete during cleanup.
healthCtx context.Context
healthCancel context.CancelFunc
healthWg sync.WaitGroup
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// healthCtx and healthCancel control the health check goroutine lifecycle.
// healthWg is used to wait for the health check goroutine to complete during cleanup.
healthCtx context.Context
healthCancel context.CancelFunc
healthWg sync.WaitGroup
// healthCtx and healthCancel control the health check goroutine lifecycle.
healthCtx context.Context
healthCancel context.CancelFunc
// healthWg is used to wait for the health check goroutine to complete during cleanup.
healthWg sync.WaitGroup

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will implement

Comment thread internal/plugin/server.go
// nvidiaDevicePlugin implements the Kubernetes device plugin API
type nvidiaDevicePlugin struct {
pluginapi.UnimplementedDevicePluginServer
ctx context.Context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why do we need a second context?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can cancel the healthchecks independently.

Comment thread internal/plugin/server.go Outdated
Comment on lines +131 to +133
// Recreate context for potential plugin restart. The same plugin instance
// may be restarted via Start() after Stop(), so we need a fresh context.
plugin.healthCtx, plugin.healthCancel = context.WithCancel(plugin.ctx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are we ever setting these to nil now? This should ONLY be done in the constructor.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by da507ef

ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 29, 2026
Initialize healthCtx and healthCancel in devicePluginForResource()
instead of in initialize() to eliminate race condition where
ListAndWatch() might access healthCtx before initialization completes.

This addresses Elezar's concern about synchronization guarantees
and follows Go best practices for immutable initialization.

Refs: NVIDIA#1601
Task: 1/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 29, 2026
Remove healthCtx and healthCancel initialization from initialize()
since they are now initialized at construction time in
devicePluginForResource().

This eliminates redundant initialization and ensures context
lifetime is tied to the plugin instance rather than the
start/stop cycle.

Refs: NVIDIA#1601
Task: 2/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 29, 2026
Recreate healthCtx and healthCancel in cleanup() after cancellation
to support plugin restart. Remove nil assignments for these fields
as they need to persist across restart cycles.

This addresses Elezar's concern NVIDIA#2 about why we nil these fields -
we no longer do. The context is recreated fresh for each restart,
ensuring health checks work correctly when the plugin is restarted.

Fixes plugin restart blocker identified in architecture review.

Refs: NVIDIA#1601
Task: 3/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 20 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/plugin/server.go
return fmt.Errorf("error waiting for MPS daemon: %w", err)
}

plugin.health = make(chan *rm.Device)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plugin.health is created as an unbuffered channel and the health-check goroutine can send to it even when no ListAndWatch stream is active. Since Stop() now cancels the context and then waits for the health goroutine to exit, a blocked send on plugin.health can deadlock Stop() (context cancellation won’t unblock a channel send). Consider making the channel buffered (e.g., sized to number of devices) and/or ensuring all sends in health-check code are non-blocking/select on ctx.Done().

Suggested change
plugin.health = make(chan *rm.Device)
// Use a buffered channel for health notifications to avoid blocking when
// no ListAndWatch stream is currently consuming from plugin.health.
healthBufSize := len(plugin.rm.Devices())
if healthBufSize < 1 {
healthBufSize = 1
}
plugin.health = make(chan *rm.Device, healthBufSize)

Copilot uses AI. Check for mistakes.
Comment thread internal/plugin/server.go
Comment on lines 125 to 129
err := plugin.Serve()
if err != nil {
klog.Errorf("Could not start device plugin for '%s': %s", plugin.rm.Resource(), err)
plugin.cleanup()
return err
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Serve() failure, Start() returns immediately after creating plugin.health and plugin.healthCtx but does not cancel/close them. If the same plugin instance is retried or kept around, this leaks resources and leaves fields in a partially-initialized state. Consider deferring a cleanup (cancel + close channel) on the Serve() error path.

Copilot uses AI. Check for mistakes.
Comment thread internal/rm/health.go
Comment on lines 99 to 101
deviceIDToGiMap[d.ID] = gi
deviceIDToCiMap[d.ID] = ci
parentToDeviceMap[uuid] = d
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentToDeviceMap is keyed by uuid returned from getDevicePlacement(), which is the parent UUID for MIG devices. If multiple MIG devices share the same parent GPU, this assignment overwrites earlier entries, so events for that parent can only ever map to one MIG device. To correctly handle multiple MIG instances per parent, use a mapping that can represent multiple devices per parent (e.g., parent UUID -> list, or parent UUID -> (GI,CI)->device).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this is a pre-existing issue. When multiple MIG devices share the same parent GPU, only the last one gets stored. I'll track this as a separate fix (needs its own tests for the multi-MIG case) in a follow-up PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. sounds good.

Comment thread internal/rm/health.go
Comment on lines 157 to 162
if ret != nvml.SUCCESS {
klog.Infof("Error waiting for event: %v; Marking all devices as unhealthy", ret)
for _, d := range devices {
unhealthy <- d
for _, d := range h.devices {
h.unhealthy <- d
}
continue
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When eventSet.Wait() returns a non-timeout error, the loop marks all devices unhealthy and continues. If the error is persistent, this can spin forever and repeatedly attempt blocking sends to h.unhealthy, preventing shutdown if the receiver isn’t draining fast enough. Consider returning the error (or backing off) and make unhealthy notifications context-aware (select on ctx.Done()) to avoid blocking indefinitely.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pre-existing behavior — the busy loop on persistent NVML errors existed before this refactor. The new code does improve the situation: markUnhealthy is now context-aware, so the loop can be interrupted by cancellation even during sends. Adding backoff or error-count limits would be a good follow-up improvement.

@guptaNswati
Copy link
Copy Markdown
Contributor

Thank you. I dint get a chance to test it. Can you please paste the test logs

Unfortunately, not, didn't save the logs, and the cluster I used is down

i know these are straightforward changes and you must have tested them very well. I feel more comfortable when i see the test logs so i will try to set up a cluster next week and see if we can test them.

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

@guptaNswati

I0216 18:53:08.797944  1 main.go:250] "Starting NVIDIA Device Plugin" version=v0.19.0-dev commit: d6c787ca9
I0216 18:53:08.797993  1 main.go:253] Starting FS watcher for /var/lib/kubelet/device-plugins
I0216 18:53:08.798043  1 main.go:260] Starting OS watcher.
I0216 18:53:08.798413  1 main.go:275] Starting Plugins.
I0216 18:53:08.798440  1 main.go:332] Loading configuration.
I0216 18:53:08.837087  1 server.go:203] Starting GRPC server for 'nvidia.com/gpu'
I0216 18:53:08.837916  1 server.go:130] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0216 18:53:08.840309  1 server.go:137] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0216 18:53:08.843525  1 health.go:76] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]

@guptaNswati
Copy link
Copy Markdown
Contributor

@guptaNswati

I0216 18:53:08.797944  1 main.go:250] "Starting NVIDIA Device Plugin" version=v0.19.0-dev commit: d6c787ca9
I0216 18:53:08.797993  1 main.go:253] Starting FS watcher for /var/lib/kubelet/device-plugins
I0216 18:53:08.798043  1 main.go:260] Starting OS watcher.
I0216 18:53:08.798413  1 main.go:275] Starting Plugins.
I0216 18:53:08.798440  1 main.go:332] Loading configuration.
I0216 18:53:08.837087  1 server.go:203] Starting GRPC server for 'nvidia.com/gpu'
I0216 18:53:08.837916  1 server.go:130] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0216 18:53:08.840309  1 server.go:137] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0216 18:53:08.843525  1 health.go:76] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]

nice. thank you. so for device-plugin we always log which xids are ignored. We dint do that for DRA. Also can please share your mock test log triggering a xid error and the action taken.

Is this a valid concern #1601 (comment)

i have a few other minor comments that i would like answered.

we are very close to merging this. Thank you @ArangoGutierrez

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

@guptaNswati I have addressed all your comments
cc @rajatchopra

@guptaNswati
Copy link
Copy Markdown
Contributor

@ArangoGutierrez this does seem like a legit bug #1601 (comment). parentToDeviceMap overwrites when multiple MIG devices share a parent GPU. I see its from the old code, but now that we are refactoring, lets fix it. This is what i did for DRA type devicePlacementMap map[string]map[uint32]map[uint32]*Device
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/cmd/gpu-kubelet-plugin/device_health.go#L36

is it possible to add test for mig devices?

Comment thread internal/plugin/server.go
klog.Errorf("Could not start device plugin for '%s': %s", plugin.rm.Resource(), err)
plugin.healthCancel()
return err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArangoGutierrez (nit) thoughhealthCancel()properly cancels things but should these still be set to nil

plugin.health = nil
plugin.server = nil

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Now the error path also calls plugin.server.Stop() and nils both plugin.server and plugin.health to leave the struct in a clean state for potential retry. See latest commit.

ArangoGutierrez and others added 10 commits March 19, 2026 21:31
Add vendored NVML mock packages from go-nvml including the dgxa100 mock
server. These mocks enable deterministic unit testing of NVML health
check behavior without requiring actual GPU hardware.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Refactor health check implementation:

- Extract health monitoring logic into nvmlHealthProvider struct with
  registerDeviceEvents() and runEventMonitor() methods, making health
  checks testable in isolation
- Migrate CheckHealth interface from stop channel to context.Context for
  idiomatic cancellation
- Extract device placement logic into withDevicePlacements struct to
  decouple from resource manager
- Clarify log message for event registration failure to explain why
  device is marked unhealthy (aligns with DRA driver convention)
- Add unit tests for XID event handling and ERROR_NOT_SUPPORTED
  regression using NVML mock infrastructure

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Replace the stop channel with context.Context cancellation and
sync.WaitGroup for health check goroutine lifecycle management:

- Initialize health channel and healthCtx before Serve() to eliminate
  race where kubelet could call ListAndWatch before fields are set
- Use sync.WaitGroup to ensure health goroutine completes before
  channel close in Stop()
- Add nil guards in Stop() to handle partial Start() failure safely
- Add structured error handling for health check completion (success,
  canceled, error cases)
- Add debug logging in ListAndWatch for context cancellation and
  channel closure
- Fix updateResponseForMPS receiver from value to pointer

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
…adlock

All sends to the unhealthy channel now use a select with ctx.Done() to
prevent blocking when no ListAndWatch stream is consuming. This fixes:
- Deadlock in Stop() when health goroutine is blocked on channel send
- Potential send-on-closed-channel panic during shutdown race

Remove close(plugin.health) from Stop() since context cancellation is
now the sole shutdown mechanism. Simplify ListAndWatch accordingly.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
If Serve() fails, healthCtx was leaked because Start() returned without
canceling it. Now explicitly cancel to avoid leaking the context goroutine.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Address reviewer feedback to include 'Device' in the name and use
'Checker' to better reflect the type's responsibility of running
health checks rather than providing health data.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Use 'p' (for placements) instead of 'r' (for resource manager) since
the receiver is no longer on nvmlResourceManager.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Explain why device placement logic is decoupled from nvmlResourceManager:
to allow reuse across different health-check providers for NVsentinel and
Device-API integration.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Address reviewer question about why the NVML interface is duplicated
on the health checker rather than accessed via nvmlResourceManager.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
On Serve() error, healthCancel was called but plugin.health and
plugin.server were left initialized. This could leak the server and
leave the plugin in a partially-initialized state if Start() is retried.
Now explicitly stop the server and nil both fields on the error path.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

@guptaNswati comments addressed, PTAL

@guptaNswati
Copy link
Copy Markdown
Contributor

guptaNswati commented Mar 19, 2026

@guptaNswati comments addressed, PTAL

Thank you @ArangoGutierrez

LGTM. Approved.

Pls. open issue for follow-up, especially this on #1601 (comment)

@guptaNswati
Copy link
Copy Markdown
Contributor

I dont seem to have approval rights.

cc @tariq1890 @shivamerla @cdesiniotis can i get access rightd?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants