fix(worker-ops): recover failed workers before scheduler gating#69
fix(worker-ops): recover failed workers before scheduler gating#69
Conversation
| let worker = self.workerCap.borrow()! | ||
|
|
||
| // Always clear failed/stale worker entries before capacity and backlog checks. | ||
| // Otherwise a reverted in-flight worker can remain stuck until new pending work arrives. |
There was a problem hiding this comment.
this comment explains the change made here. If someone reads it independently in future it doesn't make sense
There was a problem hiding this comment.
// Check and recover panicked worker entries.
PR Review: fix(ops): recover failed workers before scheduler gatingSummary: Moves CorrectnessThe fix is correct. The root cause is accurately diagnosed: The new placement is correct relative to the capacity calculation. Reference type
Paused scheduler interactionRecovery is correctly placed after the pause guard. A paused scheduler intentionally does nothing; that invariant is preserved. Docstring updateThe Comment qualityThe inline comment at the new call site clearly explains both what and why: // Always clear failed/stale worker entries before capacity and backlog checks.
// Otherwise a reverted in-flight worker can remain stuck until new pending work arrives.SuggestionsMissing regression test for the fixed scenario (non-blocking) The test suite (
Without this, the bug could silently regress. The PR description notes the existing suite was run, but those tests do not cover this no-pending-work failure path. VerdictThe change is small, focused, and correctly fixes the described bug with no apparent side effects. Logic looks good; adding a regression test for the fixed scenario is the one gap worth addressing. |
Summary
SchedulerHandler.executeTransaction(...)_runScheduler(...)Why
A failed worker request can remain stuck in
PROCESSINGuntil another pending request arrives because_checkForFailedWorkerRequests(...)only ran inside_runScheduler(...), and_runScheduler(...)only executes when the scheduler sees pending work to process.By moving failed-worker recovery ahead of capacity and pending-request gating, stale in-flight entries are cleaned up on every scheduler tick.
Testing
flow test cadence/tests/evm_bridge_lifecycle_test.cdc cadence/tests/access_control_test.cdc cadence/tests/error_handling_test.cdc cadence/tests/validation_test.cdc