fix(worker-ops): recover failed workers before scheduler gating by liobrasil · Pull Request #69 · onflow/FlowYieldVaultsEVM

liobrasil · 2026-03-11T23:38:27Z

Summary

run failed-worker recovery at the top of SchedulerHandler.executeTransaction(...)
remove the duplicate recovery call from _runScheduler(...)
keep failed/stale in-flight worker entries from waiting on fresh pending work before being marked failed

Why

A failed worker request can remain stuck in PROCESSING until another pending request arrives because _checkForFailedWorkerRequests(...) only ran inside _runScheduler(...), and _runScheduler(...) only executes when the scheduler sees pending work to process.

By moving failed-worker recovery ahead of capacity and pending-request gating, stale in-flight entries are cleaned up on every scheduler tick.

Testing

flow test cadence/tests/evm_bridge_lifecycle_test.cdc cadence/tests/access_control_test.cdc cadence/tests/error_handling_test.cdc cadence/tests/validation_test.cdc

nvdtf · 2026-03-11T23:42:55Z

cadence/contracts/FlowYieldVaultsEVMWorkerOps.cdc

            let worker = self.workerCap.borrow()!

+            // Always clear failed/stale worker entries before capacity and backlog checks.
+            // Otherwise a reverted in-flight worker can remain stuck until new pending work arrives.


this comment explains the change made here. If someone reads it independently in future it doesn't make sense

// Check and recover panicked worker entries.

claude · 2026-03-11T23:46:04Z

PR Review: fix(ops): recover failed workers before scheduler gating

Summary: Moves _checkForFailedWorkerRequests from inside _runScheduler (only called when pendingRequestCount > 0) to the top of executeTransaction (called on every scheduler tick). This fixes a real bug where a panicked worker request could remain stuck in PROCESSING indefinitely if the request queue ran dry.

Correctness

The fix is correct. The root cause is accurately diagnosed: _runScheduler is only invoked inside the pendingRequestCount > 0 branch, so a failed worker with no queued successors had no way to be cleaned up.

The new placement is correct relative to the capacity calculation. currentInFlight is read after the recovery call completes, so any slots freed by recovering stale workers are immediately reflected in capacityLimit — a single scheduler tick can both clean up a failed worker and schedule new work if capacity allows.

Reference type

_checkForFailedWorkerRequests accepts &{FlowTransactionSchedulerUtils.Manager} (non-authorized). The manager in executeTransaction carries auth restrictions. Passing an authorized reference where a less-privileged one is expected is valid Cadence ─ no issue here.

Paused scheduler interaction

Recovery is correctly placed after the pause guard. A paused scheduler intentionally does nothing; that invariant is preserved.

Docstring update

The _runScheduler doc comment is correctly trimmed from 5 steps to 4 to reflect the removed call.

Comment quality

The inline comment at the new call site clearly explains both what and why:

// Always clear failed/stale worker entries before capacity and backlog checks.
// Otherwise a reverted in-flight worker can remain stuck until new pending work arrives.

Suggestions

Missing regression test for the fixed scenario (non-blocking)

The test suite (error_handling_test.cdc, evm_bridge_lifecycle_test.cdc) has no coverage for the specific scenario this PR fixes: a worker panic with no new requests in the queue. A test along these lines would lock in the fix:

Submit a request, let it get scheduled, simulate a worker handler panic
Ensure pending request count is 0 (no new requests queued)
Tick the scheduler
Assert: request status is FAILED (not still PROCESSING)
Assert: scheduledRequests is empty

Without this, the bug could silently regress. The PR description notes the existing suite was run, but those tests do not cover this no-pending-work failure path.

Verdict

The change is small, focused, and correctly fixes the described bug with no apparent side effects. Logic looks good; adding a regression test for the fixed scenario is the one gap worth addressing.

fix(worker-ops): recover failed workers before scheduler gating

14fcda4

liobrasil requested a review from nvdtf March 11, 2026 23:41

nvdtf reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker-ops): recover failed workers before scheduler gating#69

fix(worker-ops): recover failed workers before scheduler gating#69
liobrasil wants to merge 1 commit intomainfrom
lionel/fix-worker-recovery-before-gating

liobrasil commented Mar 11, 2026

Uh oh!

nvdtf Mar 11, 2026

Uh oh!

nvdtf Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liobrasil commented Mar 11, 2026

Summary

Why

Testing

Uh oh!

nvdtf Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

nvdtf Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(ops): recover failed workers before scheduler gating

Correctness

Reference type

Paused scheduler interaction

Docstring update

Comment quality

Suggestions

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Mar 11, 2026 •

edited

Loading