Skip to content

fix(worker-ops): recover failed workers before scheduler gating#69

Open
liobrasil wants to merge 1 commit intomainfrom
lionel/fix-worker-recovery-before-gating
Open

fix(worker-ops): recover failed workers before scheduler gating#69
liobrasil wants to merge 1 commit intomainfrom
lionel/fix-worker-recovery-before-gating

Conversation

@liobrasil
Copy link
Collaborator

Summary

  • run failed-worker recovery at the top of SchedulerHandler.executeTransaction(...)
  • remove the duplicate recovery call from _runScheduler(...)
  • keep failed/stale in-flight worker entries from waiting on fresh pending work before being marked failed

Why

A failed worker request can remain stuck in PROCESSING until another pending request arrives because _checkForFailedWorkerRequests(...) only ran inside _runScheduler(...), and _runScheduler(...) only executes when the scheduler sees pending work to process.

By moving failed-worker recovery ahead of capacity and pending-request gating, stale in-flight entries are cleaned up on every scheduler tick.

Testing

  • flow test cadence/tests/evm_bridge_lifecycle_test.cdc cadence/tests/access_control_test.cdc cadence/tests/error_handling_test.cdc cadence/tests/validation_test.cdc

@liobrasil liobrasil requested a review from nvdtf March 11, 2026 23:41
let worker = self.workerCap.borrow()!

// Always clear failed/stale worker entries before capacity and backlog checks.
// Otherwise a reverted in-flight worker can remain stuck until new pending work arrives.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment explains the change made here. If someone reads it independently in future it doesn't make sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Check and recover panicked worker entries.

@claude
Copy link

claude bot commented Mar 11, 2026

PR Review: fix(ops): recover failed workers before scheduler gating

Summary: Moves _checkForFailedWorkerRequests from inside _runScheduler (only called when pendingRequestCount > 0) to the top of executeTransaction (called on every scheduler tick). This fixes a real bug where a panicked worker request could remain stuck in PROCESSING indefinitely if the request queue ran dry.


Correctness

The fix is correct. The root cause is accurately diagnosed: _runScheduler is only invoked inside the pendingRequestCount > 0 branch, so a failed worker with no queued successors had no way to be cleaned up.

The new placement is correct relative to the capacity calculation. currentInFlight is read after the recovery call completes, so any slots freed by recovering stale workers are immediately reflected in capacityLimit — a single scheduler tick can both clean up a failed worker and schedule new work if capacity allows.

Reference type

_checkForFailedWorkerRequests accepts &{FlowTransactionSchedulerUtils.Manager} (non-authorized). The manager in executeTransaction carries auth restrictions. Passing an authorized reference where a less-privileged one is expected is valid Cadence ─ no issue here.

Paused scheduler interaction

Recovery is correctly placed after the pause guard. A paused scheduler intentionally does nothing; that invariant is preserved.

Docstring update

The _runScheduler doc comment is correctly trimmed from 5 steps to 4 to reflect the removed call.

Comment quality

The inline comment at the new call site clearly explains both what and why:

// Always clear failed/stale worker entries before capacity and backlog checks.
// Otherwise a reverted in-flight worker can remain stuck until new pending work arrives.

Suggestions

Missing regression test for the fixed scenario (non-blocking)

The test suite (error_handling_test.cdc, evm_bridge_lifecycle_test.cdc) has no coverage for the specific scenario this PR fixes: a worker panic with no new requests in the queue. A test along these lines would lock in the fix:

  1. Submit a request, let it get scheduled, simulate a worker handler panic
  2. Ensure pending request count is 0 (no new requests queued)
  3. Tick the scheduler
  4. Assert: request status is FAILED (not still PROCESSING)
  5. Assert: scheduledRequests is empty

Without this, the bug could silently regress. The PR description notes the existing suite was run, but those tests do not cover this no-pending-work failure path.


Verdict

The change is small, focused, and correctly fixes the described bug with no apparent side effects. Logic looks good; adding a regression test for the fixed scenario is the one gap worth addressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants