fix(run-engine): sweeper LREMs the worker queue list when acking marked runs#3527
fix(run-engine): sweeper LREMs the worker queue list when acking marked runs#3527
Conversation
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
🧰 Additional context used📓 Path-based instructions (10)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{test,spec}.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
{apps,internal-packages}/**/*.{ts,tsx,js}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.test.{ts,tsx,js}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.test.{ts,tsx}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
{package.json,**/*.{ts,tsx,js}}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.{ts,tsx,js}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.{ts,tsx,js,jsx,json,md,css,scss}📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (2)📚 Learning: 2026-03-22T13:26:12.060ZApplied to files:
📚 Learning: 2026-03-22T19:24:14.403ZApplied to files:
🔇 Additional comments (2)
WalkthroughThe concurrency sweeper was changed to remove the corresponding message entry from the worker queue list when acknowledging marked runs. A server-changes note documents this behavior. The code update flips Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Adds a failing test in concurrencySweeper.test.ts that demonstrates an inconsistency between the worker queue list and the message body store: - Fast-path enqueue RPUSHes the messageKey value onto the worker queue list and SADDs the run into currentConcurrency. - The sweeper marks the run as completed (via test callback) and processMarkedRun acks with removeFromWorkerQueue: false, which DELs the message body but skips the LREM on the worker queue list. - A subsequent blocking dequeue BLPOPs the stale messageKey, the GET returns nil, and the dequeue path emits "Failed to dequeue message from worker queue" with workerQueueLength: 0. The test asserts that the dequeue path does not log this error after the sweeper has acked the run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed runs processMarkedRun previously called acknowledgeMessage with removeFromWorkerQueue: false. The Lua script always DELs the message body but only LREMs the worker queue list when the flag is set, leaving a stale messageKey on the list whenever the swept run had been pushed to the worker queue list (fast-path enqueue or processQueueForWorkerQueue promotion) but never BLPOP'd. The next BLPOP returns the tombstone, GET on the messageKey returns nil, and the dequeue path logs "Failed to dequeue message from worker queue". Switch to removeFromWorkerQueue: true. Cost is one extra LREM per swept run (O(N) on the list); production telemetry shows sweeper acks at <0.33/sec hard upper bound on worker queue lists averaging 200-500 entries — under 0.05% Redis CPU even at peak. The repro test added in the previous commit flips to passing with this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
847a4ef to
eacd023
Compare
… logger Address CodeRabbit feedback on concurrencySweeper.test.ts: - Replace vi.spyOn(logger, "error") with a peekAllOnWorkerQueue assertion. The state-based check proves the same invariant (no tombstone exists, so no future BLPOP can fail) and matches the repo's "never mock anything" testing convention. - Update the stale inline comment that said the sweeper acks with removeFromWorkerQueue: false; the implementation now uses true. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The concurrency sweeper's
processMarkedRuncallsacknowledgeMessagewithremoveFromWorkerQueue: false. The Lua script alwaysDELs the message body but onlyLREMs the worker queue list when the flag is set. Result: any run that was pushed onto the worker queue list (fast-path enqueue orprocessQueueForWorkerQueuepromotion) but not yetBLPOP'd when the sweeper finds it leaves a stalemessageKeyvalue sitting on the list. The next workerBLPOPreturns the tombstone,GET messageKeyreturns nil, and the dequeue path logsFailed to dequeue message from worker queue.The original assumption was that sweeper-acked runs had already been BLPOP'd off the list (entry already gone, nothing to LREM). That holds when the worker BLPOPs and then dies, but not when fast-path enqueue or
processQueueForWorkerQueueadds the entry to the list and the run then ages past the sweeper's 10-minute completed-at threshold without ever being popped.Switch to
removeFromWorkerQueue: true. Cost is one extraLREMper swept run, O(N) over the worker queue list length. Sweeper acks are bounded by the cron schedule (every 5 minutes, max 100 marked runs per fire), so the added Redis work is negligible relative to baseline ack workload.Test
internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.tsadds a regression test that fast-path enqueues a message, runs the sweeper, then dequeues. Pre-fix the dequeue path logs the "Failed to dequeue message from worker queue" error; post-fix the worker queue list is left clean and the dequeue returns undefined silently.