Skip to content

feat(webapp,redis): handle UNBLOCKED during ElastiCache role change#3549

Merged
ericallam merged 1 commit into
mainfrom
feature/tri-9217-featwebappredis-handle-unblocked-during-elasticache-scale-up
May 11, 2026
Merged

feat(webapp,redis): handle UNBLOCKED during ElastiCache role change#3549
ericallam merged 1 commit into
mainfrom
feature/tri-9217-featwebappredis-handle-unblocked-during-elasticache-scale-up

Conversation

@ericallam
Copy link
Copy Markdown
Member

@ericallam ericallam commented May 11, 2026

Summary

When ElastiCache demotes a primary to replica — during a Multi-AZ failover or a vertical node-type change — the demoting primary issues an UNBLOCKED reply to any in-flight blocking commands (BLPOP, BRPOP, BLMOVE, XREADGROUP ... BLOCK, etc.) to clear them before the role flips. ioredis surfaces these as ReplyError to caller code.

The shared defaultReconnectOnError added in #3548 only matches READONLY and LOADING. This extends it to UNBLOCKED so the disconnect-reconnect-retry cycle handles BLPOP-shaped errors the same way the existing two cases handle non-blocking-command errors.

Fix

export function defaultReconnectOnError(err: Error): boolean | 1 | 2 {
  const msg = err.message ?? "";
  if (
    msg.startsWith("READONLY") ||
    msg.startsWith("LOADING") ||
    msg.startsWith("UNBLOCKED")
  ) {
    return 2;
  }
  return false;
}

Returning 2 tells ioredis to disconnect, reconnect, and re-issue the command. For a BLPOP that means a fresh BLPOP against the new primary instead of the UNBLOCKED error escaping to the caller.

Test plan

  • CI green
  • Trigger a Multi-AZ failover or a vertical scale event on an ElastiCache replication group whose clients are running blocking commands and confirm no UNBLOCKED errors surface to caller code during the cutover.

PR #3548's defaultReconnectOnError only matches READONLY and LOADING.
The TRI-8873 test-cloud scale-up dry-run exposed a third reply-error
pattern: UNBLOCKED.

When ElastiCache demotes a node from primary to replica, the (still)
primary issues an UNBLOCKED reply to any in-flight blocking commands
(BLPOP, BRPOP, BLMOVE, XREADGROUP ... BLOCK, etc.) on connections
that the cutover is about to take down. ioredis surfaces this as a
ReplyError with message:

  UNBLOCKED force unblock from blocking operation,
  instance state changed (master -> replica?)

The TRI-8873 scale-up triggered 65 of these in a single instant at
the cutover, all on engine/v1/worker-actions/dequeue (the supervisor's
BLPOP-on-the-run-queue path). Supervisors retried so customer impact
was minimal, but it's a real gap in the mitigation.

Adding 'UNBLOCKED' to the matcher makes ioredis disconnect, reconnect,
and re-issue the failed command against the new primary — same
disconnect-reconnect-retry pattern READONLY/LOADING already use.

Refs TRI-9217 TRI-8873 TRI-8868 TRI-8878.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 11, 2026

⚠️ No Changeset found

Latest commit: 7e49f08

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Review Change Stack

Walkthrough

This PR extends the ioredis reconnect-on-error hook to recognize UNBLOCKED responses alongside existing READONLY and LOADING prefixes. The defaultReconnectOnError function is updated to return 2 when any of these three error prefixes are detected, enabling blocking Redis commands like BLPOP to transparently recover when ElastiCache forces unblocking during node role transitions. A changelog entry documents this enhancement.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: extending error handling to include UNBLOCKED errors during ElastiCache role changes, matching the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description provides comprehensive context, including the problem (UNBLOCKED errors during ElastiCache failover), the fix implementation, and testing approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tri-9217-featwebappredis-handle-unblocked-during-elasticache-scale-up

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericallam ericallam marked this pull request as ready for review May 11, 2026 09:21
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

@ericallam ericallam merged commit a5ba406 into main May 11, 2026
44 checks passed
@ericallam ericallam deleted the feature/tri-9217-featwebappredis-handle-unblocked-during-elasticache-scale-up branch May 11, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants