Refactor health checks by ArangoGutierrez · Pull Request #1601 · NVIDIA/k8s-device-plugin

ArangoGutierrez · 2026-01-26T12:02:09Z

No description provided.

Copilot

Pull request overview

Refactors the device health-check lifecycle to use context.Context cancellation instead of a stop channel, and adds NVML mock infrastructure/tests to validate health-check behavior.

Changes:

Switch ResourceManager.CheckHealth to CheckHealth(ctx context.Context, unhealthy chan<- *Device) and propagate through plugin + managers.
Refactor NVML health monitoring into a provider (nvmlHealthProvider) and add tests covering XID handling and ERROR_NOT_SUPPORTED behavior.
Vendor in NVML mock packages (including a dgxa100 mock server) to support deterministic health-check testing.

Reviewed changes

Copilot reviewed 7 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`internal/rm/rm.go`	Updates `ResourceManager` interface to use `context.Context` for health checks.
`internal/rm/nvml_manager.go`	Adapts NVML RM `CheckHealth` to the new context-based signature.
`internal/rm/tegra_manager.go`	Updates Tegra RM `CheckHealth` signature (still effectively disabled).
`internal/rm/health.go`	Refactors health-check implementation into `nvmlHealthProvider` and context-driven loop.
`internal/rm/health_test.go`	Adds tests using NVML mocks to validate event-driven health behavior.
`internal/rm/rm_mock.go`	Regenerates moq mock to match the new `CheckHealth` signature.
`internal/plugin/server.go`	Replaces stop channel with cancelable context + waitgroup for health goroutine lifecycle.
`vendor/modules.txt`	Adds vendored paths for newly included NVML mock packages.
`vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/*.go`	Adds generated moq mocks for NVML interfaces used in tests.
`vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/mig-profile.go`	Adds MIG profile/placement data for the dgxa100 mock server.
`vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/dgxa100.go`	Adds dgxa100 NVML mock server used by health-check tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

elezar

Thanks @ArangoGutierrez. I have a minor question regarding the cleaning up of the context.

elezar · 2026-01-28T07:39:51Z

+	// Capture references at start to avoid race with cleanup() which may nil these fields.
+	healthCtx := plugin.healthCtx
+	health := plugin.health


Do we need to add synchronisation to ensure that these are actually initialised by the time that ListAndWatch is called, or is it guaranteed that this will be called only after the context has been initialized. Looking at the cleanup code, it is not clear why we would set plugin.healthCtx to nil there.

Note that if we were to pass the context in to the server construction, then we would not have to do this since the lifetime would be controlled at the caller and we would not set the context to nil in plugin cleanup.

Addressed, now healthCtx and healthCancel now created in devicePluginForResource()

elezar · 2026-01-28T07:41:02Z

+	plugin.healthCtx = nil
+	plugin.healthCancel = nil


Why are we setting these to nil now?

removed.
Fields preserved (not niled) for restart support

elezar · 2026-01-28T07:50:21Z

+		if err != nil && !errors.Is(err, context.Canceled) {
 			klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
 		}


What about logging the happy path case? (Also in chich cases will we return contex.Cancelled?

Suggested change

if err != nil && !errors.Is(err, context.Canceled) {

klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)

}

switch {

case err == nil:

fallthrough

case errors.Is(err, context.Cancelled):

klog.Infof("Stopping health check")

case err != nil:

klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)

}

}

Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Remove healthCtx and healthCancel initialization from initialize() since they are now initialized at construction time in devicePluginForResource(). This eliminates redundant initialization and ensures context lifetime is tied to the plugin instance rather than the start/stop cycle. Refs: NVIDIA#1601 Task: 2/6

Recreate healthCtx and healthCancel in cleanup() after cancellation to support plugin restart. Remove nil assignments for these fields as they need to persist across restart cycles. This addresses Elezar's concern NVIDIA#2 about why we nil these fields - we no longer do. The context is recreated fresh for each restart, ensuring health checks work correctly when the plugin is restarted. Fixes plugin restart blocker identified in architecture review. Refs: NVIDIA#1601 Task: 3/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Close the health channel in cleanup() before niling to prevent panics in ListAndWatch() if cleanup happens during channel read. The WaitGroup ensures the health goroutine has completed before we close the channel, making this operation safe. Fixes critical blocker: health channel was never closed, leading to potential panics and resource leaks. Refs: NVIDIA#1601 Task: 4/6

Add ok check when receiving from health channel to gracefully handle channel closure during cleanup. This prevents potential panics and ensures clean shutdown when the plugin stops. Reading from a closed channel returns zero value and ok=false, which we now check and return gracefully. Refs: NVIDIA#1601 Task: 5/6

Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Remove healthCtx and healthCancel initialization from initialize() since they are now initialized at construction time in devicePluginForResource(). This eliminates redundant initialization and ensures context lifetime is tied to the plugin instance rather than the start/stop cycle. Refs: NVIDIA#1601 Task: 2/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Recreate healthCtx and healthCancel in cleanup() after cancellation to support plugin restart. Remove nil assignments for these fields as they need to persist across restart cycles. This addresses Elezar's concern NVIDIA#2 about why we nil these fields - we no longer do. The context is recreated fresh for each restart, ensuring health checks work correctly when the plugin is restarted. Fixes plugin restart blocker identified in architecture review. Refs: NVIDIA#1601 Task: 3/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Close the health channel in cleanup() before niling to prevent panics in ListAndWatch() if cleanup happens during channel read. The WaitGroup ensures the health goroutine has completed before we close the channel, making this operation safe. Fixes critical blocker: health channel was never closed, leading to potential panics and resource leaks. Refs: NVIDIA#1601 Task: 4/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add ok check when receiving from health channel to gracefully handle channel closure during cleanup. This prevents potential panics and ensures clean shutdown when the plugin stops. Reading from a closed channel returns zero value and ok=false, which we now check and return gracefully. Refs: NVIDIA#1601 Task: 5/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

elezar · 2026-01-29T10:28:14Z

+	// healthCtx and healthCancel control the health check goroutine lifecycle.
+	// healthWg is used to wait for the health check goroutine to complete during cleanup.
+	healthCtx    context.Context
+	healthCancel context.CancelFunc
+	healthWg     sync.WaitGroup


Suggested change

// healthCtx and healthCancel control the health check goroutine lifecycle.

// healthWg is used to wait for the health check goroutine to complete during cleanup.

healthCtx context.Context

healthCancel context.CancelFunc

healthWg sync.WaitGroup

// healthCtx and healthCancel control the health check goroutine lifecycle.

healthCtx context.Context

healthCancel context.CancelFunc

// healthWg is used to wait for the health check goroutine to complete during cleanup.

healthWg sync.WaitGroup

Thanks will implement

elezar · 2026-01-29T10:29:40Z

 // nvidiaDevicePlugin implements the Kubernetes device plugin API
 type nvidiaDevicePlugin struct {
 	pluginapi.UnimplementedDevicePluginServer
 	ctx                  context.Context


Question: Why do we need a second context?

So we can cancel the healthchecks independently.

elezar · 2026-01-29T10:31:23Z

+		// Recreate context for potential plugin restart. The same plugin instance
+		// may be restarted via Start() after Stop(), so we need a fresh context.
+		plugin.healthCtx, plugin.healthCancel = context.WithCancel(plugin.ctx)


When are we ever setting these to nil now? This should ONLY be done in the constructor.

Addressed by da507ef

Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Remove healthCtx and healthCancel initialization from initialize() since they are now initialized at construction time in devicePluginForResource(). This eliminates redundant initialization and ensures context lifetime is tied to the plugin instance rather than the start/stop cycle. Refs: NVIDIA#1601 Task: 2/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Recreate healthCtx and healthCancel in cleanup() after cancellation to support plugin restart. Remove nil assignments for these fields as they need to persist across restart cycles. This addresses Elezar's concern NVIDIA#2 about why we nil these fields - we no longer do. The context is recreated fresh for each restart, ensuring health checks work correctly when the plugin is restarted. Fixes plugin restart blocker identified in architecture review. Refs: NVIDIA#1601 Task: 3/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 20 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-12T19:35:13Z

 		return fmt.Errorf("error waiting for MPS daemon: %w", err)
 	}

+	plugin.health = make(chan *rm.Device)


plugin.health is created as an unbuffered channel and the health-check goroutine can send to it even when no ListAndWatch stream is active. Since Stop() now cancels the context and then waits for the health goroutine to exit, a blocked send on plugin.health can deadlock Stop() (context cancellation won’t unblock a channel send). Consider making the channel buffered (e.g., sized to number of devices) and/or ensuring all sends in health-check code are non-blocking/select on ctx.Done().

Suggested change

plugin.health = make(chan *rm.Device)

// Use a buffered channel for health notifications to avoid blocking when

// no ListAndWatch stream is currently consuming from plugin.health.

healthBufSize := len(plugin.rm.Devices())

if healthBufSize < 1 {

healthBufSize = 1

}

plugin.health = make(chan *rm.Device, healthBufSize)

Copilot · 2026-02-12T19:35:14Z

 	err := plugin.Serve()
 	if err != nil {
 		klog.Errorf("Could not start device plugin for '%s': %s", plugin.rm.Resource(), err)
-		plugin.cleanup()
 		return err
 	}


On Serve() failure, Start() returns immediately after creating plugin.health and plugin.healthCtx but does not cancel/close them. If the same plugin instance is retried or kept around, this leaks resources and leaves fields in a partially-initialized state. Consider deferring a cleanup (cancel + close channel) on the Serve() error path.

Copilot · 2026-02-12T19:35:14Z

 		deviceIDToGiMap[d.ID] = gi
 		deviceIDToCiMap[d.ID] = ci
 		parentToDeviceMap[uuid] = d


parentToDeviceMap is keyed by uuid returned from getDevicePlacement(), which is the parent UUID for MIG devices. If multiple MIG devices share the same parent GPU, this assignment overwrites earlier entries, so events for that parent can only ever map to one MIG device. To correctly handle multiple MIG instances per parent, use a mapping that can represent multiple devices per parent (e.g., parent UUID -> list, or parent UUID -> (GI,CI)->device).

Good catch — this is a pre-existing issue. When multiple MIG devices share the same parent GPU, only the last one gets stored. I'll track this as a separate fix (needs its own tests for the multi-MIG case) in a follow-up PR.

yes. sounds good.

Copilot · 2026-02-12T19:35:14Z

 		if ret != nvml.SUCCESS {
 			klog.Infof("Error waiting for event: %v; Marking all devices as unhealthy", ret)
-			for _, d := range devices {
-				unhealthy <- d
+			for _, d := range h.devices {
+				h.unhealthy <- d
 			}
 			continue


When eventSet.Wait() returns a non-timeout error, the loop marks all devices unhealthy and continues. If the error is persistent, this can spin forever and repeatedly attempt blocking sends to h.unhealthy, preventing shutdown if the receiver isn’t draining fast enough. Consider returning the error (or backing off) and make unhealthy notifications context-aware (select on ctx.Done()) to avoid blocking indefinitely.

This is pre-existing behavior — the busy loop on persistent NVML errors existed before this refactor. The new code does improve the situation: markUnhealthy is now context-aware, so the loop can be interrupted by cancellation even during sends. Adding backoff or error-count limits would be a good follow-up improvement.

guptaNswati · 2026-02-14T00:23:19Z

Thank you. I dint get a chance to test it. Can you please paste the test logs

Unfortunately, not, didn't save the logs, and the cluster I used is down

i know these are straightforward changes and you must have tested them very well. I feel more comfortable when i see the test logs so i will try to set up a cluster next week and see if we can test them.

ArangoGutierrez · 2026-02-16T19:00:10Z

@guptaNswati

I0216 18:53:08.797944  1 main.go:250] "Starting NVIDIA Device Plugin" version=v0.19.0-dev commit: d6c787ca9
I0216 18:53:08.797993  1 main.go:253] Starting FS watcher for /var/lib/kubelet/device-plugins
I0216 18:53:08.798043  1 main.go:260] Starting OS watcher.
I0216 18:53:08.798413  1 main.go:275] Starting Plugins.
I0216 18:53:08.798440  1 main.go:332] Loading configuration.
I0216 18:53:08.837087  1 server.go:203] Starting GRPC server for 'nvidia.com/gpu'
I0216 18:53:08.837916  1 server.go:130] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0216 18:53:08.840309  1 server.go:137] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0216 18:53:08.843525  1 health.go:76] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]

guptaNswati · 2026-02-17T18:56:01Z

@guptaNswati

I0216 18:53:08.797944  1 main.go:250] "Starting NVIDIA Device Plugin" version=v0.19.0-dev commit: d6c787ca9
I0216 18:53:08.797993  1 main.go:253] Starting FS watcher for /var/lib/kubelet/device-plugins
I0216 18:53:08.798043  1 main.go:260] Starting OS watcher.
I0216 18:53:08.798413  1 main.go:275] Starting Plugins.
I0216 18:53:08.798440  1 main.go:332] Loading configuration.
I0216 18:53:08.837087  1 server.go:203] Starting GRPC server for 'nvidia.com/gpu'
I0216 18:53:08.837916  1 server.go:130] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0216 18:53:08.840309  1 server.go:137] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0216 18:53:08.843525  1 health.go:76] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]

nice. thank you. so for device-plugin we always log which xids are ignored. We dint do that for DRA. Also can please share your mock test log triggering a xid error and the action taken.

Is this a valid concern #1601 (comment)

i have a few other minor comments that i would like answered.

we are very close to merging this. Thank you @ArangoGutierrez

ArangoGutierrez · 2026-03-11T14:35:59Z

@guptaNswati I have addressed all your comments
cc @rajatchopra

guptaNswati · 2026-03-19T19:24:50Z

@ArangoGutierrez this does seem like a legit bug #1601 (comment). parentToDeviceMap overwrites when multiple MIG devices share a parent GPU. I see its from the old code, but now that we are refactoring, lets fix it. This is what i did for DRA type devicePlacementMap map[string]map[uint32]map[uint32]*Device
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/cmd/gpu-kubelet-plugin/device_health.go#L36

is it possible to add test for mig devices?

guptaNswati · 2026-03-19T19:35:15Z

 		klog.Errorf("Could not start device plugin for '%s': %s", plugin.rm.Resource(), err)
+		plugin.healthCancel()
 		return err
 	}


@ArangoGutierrez (nit) thoughhealthCancel()properly cancels things but should these still be set to nil

plugin.health = nil plugin.server = nil

Good point. Now the error path also calls plugin.server.Stop() and nils both plugin.server and plugin.health to leave the struct in a clean state for potential retry. See latest commit.

Add vendored NVML mock packages from go-nvml including the dgxa100 mock server. These mocks enable deterministic unit testing of NVML health check behavior without requiring actual GPU hardware. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Refactor health check implementation: - Extract health monitoring logic into nvmlHealthProvider struct with registerDeviceEvents() and runEventMonitor() methods, making health checks testable in isolation - Migrate CheckHealth interface from stop channel to context.Context for idiomatic cancellation - Extract device placement logic into withDevicePlacements struct to decouple from resource manager - Clarify log message for event registration failure to explain why device is marked unhealthy (aligns with DRA driver convention) - Add unit tests for XID event handling and ERROR_NOT_SUPPORTED regression using NVML mock infrastructure Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Replace the stop channel with context.Context cancellation and sync.WaitGroup for health check goroutine lifecycle management: - Initialize health channel and healthCtx before Serve() to eliminate race where kubelet could call ListAndWatch before fields are set - Use sync.WaitGroup to ensure health goroutine completes before channel close in Stop() - Add nil guards in Stop() to handle partial Start() failure safely - Add structured error handling for health check completion (success, canceled, error cases) - Add debug logging in ListAndWatch for context cancellation and channel closure - Fix updateResponseForMPS receiver from value to pointer Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: Evan Lezar <elezar@nvidia.com>

…adlock All sends to the unhealthy channel now use a select with ctx.Done() to prevent blocking when no ListAndWatch stream is consuming. This fixes: - Deadlock in Stop() when health goroutine is blocked on channel send - Potential send-on-closed-channel panic during shutdown race Remove close(plugin.health) from Stop() since context cancellation is now the sole shutdown mechanism. Simplify ListAndWatch accordingly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

If Serve() fails, healthCtx was leaked because Start() returned without canceling it. Now explicitly cancel to avoid leaking the context goroutine. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Address reviewer feedback to include 'Device' in the name and use 'Checker' to better reflect the type's responsibility of running health checks rather than providing health data. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Use 'p' (for placements) instead of 'r' (for resource manager) since the receiver is no longer on nvmlResourceManager. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Explain why device placement logic is decoupled from nvmlResourceManager: to allow reuse across different health-check providers for NVsentinel and Device-API integration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Address reviewer question about why the NVML interface is duplicated on the health checker rather than accessed via nvmlResourceManager. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

On Serve() error, healthCancel was called but plugin.health and plugin.server were left initialized. This could leak the server and leave the plugin in a partially-initialized state if Start() is retried. Now explicitly stop the server and nil both fields on the error path. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez · 2026-03-19T20:58:16Z

@guptaNswati comments addressed, PTAL

guptaNswati · 2026-03-19T22:02:44Z

@guptaNswati comments addressed, PTAL

Thank you @ArangoGutierrez

LGTM. Approved.

Pls. open issue for follow-up, especially this on #1601 (comment)

guptaNswati · 2026-03-19T22:05:34Z

I dont seem to have approval rights.

cc @tariq1890 @shivamerla @cdesiniotis can i get access rightd?

ArangoGutierrez requested review from Copilot and elezar January 26, 2026 12:02

ArangoGutierrez self-assigned this Jan 26, 2026

Copilot started reviewing on behalf of ArangoGutierrez January 26, 2026 12:02 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Comment thread internal/plugin/server.go Outdated

ArangoGutierrez force-pushed the healthcheck branch 2 times, most recently from ad1263e to c352a5a Compare January 26, 2026 12:23

ArangoGutierrez requested review from guptaNswati, rahulait and rajathagasthya January 27, 2026 19:27

elezar reviewed Jan 28, 2026

View reviewed changes

ArangoGutierrez force-pushed the healthcheck branch from c352a5a to 7a0f2aa Compare January 28, 2026 11:26

ArangoGutierrez force-pushed the healthcheck branch from 7a0f2aa to 14c5914 Compare January 28, 2026 11:27

ArangoGutierrez requested a review from elezar January 28, 2026 12:02

elezar reviewed Jan 29, 2026

View reviewed changes

ArangoGutierrez requested review from cdesiniotis, Copilot and guptaNswati February 12, 2026 19:29

Copilot started reviewing on behalf of ArangoGutierrez February 12, 2026 19:30 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

This was referenced Mar 5, 2026

Refactor checkHealth function #1508

Closed

Add advanced health check configuration to config file #1360

Closed

ArangoGutierrez force-pushed the healthcheck branch from 5f4e567 to f4d7e99 Compare March 11, 2026 14:53

guptaNswati reviewed Mar 19, 2026

View reviewed changes

ArangoGutierrez and others added 10 commits March 19, 2026 21:31

fix(plugin): cancel health context on Serve() failure

ad34f6a

If Serve() fails, healthCtx was leaked because Start() returned without canceling it. Now explicitly cancel to avoid leaking the context goroutine. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

refactor(rm): rename withDevicePlacements receiver from r to p

0b79875

Use 'p' (for placements) instead of 'r' (for resource manager) since the receiver is no longer on nvmlResourceManager. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

docs(rm): document why nvmllib is stored on health checker struct

601cd44

Address reviewer question about why the NVML interface is duplicated on the health checker rather than accessed via nvmlResourceManager. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez force-pushed the healthcheck branch from f4d7e99 to 0eaee7f Compare March 19, 2026 20:56

ArangoGutierrez requested a review from guptaNswati March 19, 2026 20:58

rajatchopra mentioned this pull request Apr 10, 2026

WIP: Refactor health checks #1538

Closed

-		if err != nil && !errors.Is(err, context.Canceled) {
-			klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
-		}
+		switch {
+		case err == nil:
+				fallthrough
+		case errors.Is(err, context.Cancelled):
+				klog.Infof("Stopping health check")
+		case err != nil:
+				klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
+		}
+}

-	plugin.health = make(chan *rm.Device)
+	// Use a buffered channel for health notifications to avoid blocking when
+	// no ListAndWatch stream is currently consuming from plugin.health.
+	healthBufSize := len(plugin.rm.Devices())
+	if healthBufSize < 1 {
+		healthBufSize = 1
+	}
+	plugin.health = make(chan *rm.Device, healthBufSize)

Conversation

ArangoGutierrez commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Feb 14, 2026

Uh oh!

ArangoGutierrez commented Feb 16, 2026

Uh oh!

guptaNswati commented Feb 17, 2026

Uh oh!

ArangoGutierrez commented Mar 11, 2026

Uh oh!

guptaNswati commented Mar 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented Mar 19, 2026

Uh oh!

guptaNswati commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guptaNswati commented Mar 19, 2026

ArangoGutierrez Jan 28, 2026 •

edited

Loading

guptaNswati commented Mar 19, 2026 •

edited

Loading