feat(webapp,redis): handle UNBLOCKED during ElastiCache role change#3549
Conversation
PR #3548's defaultReconnectOnError only matches READONLY and LOADING. The TRI-8873 test-cloud scale-up dry-run exposed a third reply-error pattern: UNBLOCKED. When ElastiCache demotes a node from primary to replica, the (still) primary issues an UNBLOCKED reply to any in-flight blocking commands (BLPOP, BRPOP, BLMOVE, XREADGROUP ... BLOCK, etc.) on connections that the cutover is about to take down. ioredis surfaces this as a ReplyError with message: UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?) The TRI-8873 scale-up triggered 65 of these in a single instant at the cutover, all on engine/v1/worker-actions/dequeue (the supervisor's BLPOP-on-the-run-queue path). Supervisors retried so customer impact was minimal, but it's a real gap in the mitigation. Adding 'UNBLOCKED' to the matcher makes ioredis disconnect, reconnect, and re-issue the failed command against the new primary — same disconnect-reconnect-retry pattern READONLY/LOADING already use. Refs TRI-9217 TRI-8873 TRI-8868 TRI-8878.
|
WalkthroughThis PR extends the Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
When ElastiCache demotes a primary to replica — during a Multi-AZ failover or a vertical node-type change — the demoting primary issues an
UNBLOCKEDreply to any in-flight blocking commands (BLPOP,BRPOP,BLMOVE,XREADGROUP ... BLOCK, etc.) to clear them before the role flips. ioredis surfaces these asReplyErrorto caller code.The shared
defaultReconnectOnErroradded in #3548 only matchesREADONLYandLOADING. This extends it toUNBLOCKEDso the disconnect-reconnect-retry cycle handles BLPOP-shaped errors the same way the existing two cases handle non-blocking-command errors.Fix
Returning
2tells ioredis to disconnect, reconnect, and re-issue the command. For a BLPOP that means a fresh BLPOP against the new primary instead of theUNBLOCKEDerror escaping to the caller.Test plan
UNBLOCKEDerrors surface to caller code during the cutover.