feat(webapp,redis): handle READONLY / LOADING during ElastiCache failover#3548
feat(webapp,redis): handle READONLY / LOADING during ElastiCache failover#3548
Conversation
|
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (9)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
WalkthroughThis PR defines and exports a new Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
During an ElastiCache role swap or node-type change, the TCP/TLS connection stays open but the server starts answering with READONLY (the client is talking to a node that became a replica) or LOADING (node still loading data from disk). Without an explicit hook, these errors surface to caller code as ReplyError instances — every write op on the affected connection fails until the cluster cuts over. Returning 2 from reconnectOnError tells ioredis to tear down the connection, reconnect, and re-issue the failed command. After reconnect, DNS / SG state routes the new socket to a writable node. The shared createRedisClient helper picks this up automatically. defaultReconnectOnError is exported for direct ioredis call sites that bypass createRedisClient.
The webapp constructs ioredis clients directly in several prod files that bypass the shared createRedisClient helper. Each of these needs the same reconnectOnError hook so an ElastiCache role swap or node-type change on the realtime / cache / socket-io / cluster Redis instances doesn't surface READONLY / LOADING reply errors to caller code. V1-only marqs files are intentionally not migrated.
0ed192d to
4df210c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/webapp/app/services/realtime/redisRealtimeStreams.server.ts`:
- Around line 39-40: The reconnectOnError property is currently placed before
the spread of ...this.options.redis which lets callers override or nullify the
default failover handler; move the reconnectOnError assignment to after the
spread and set it using a nullish-coalescing fallback (i.e., use the
caller-provided reconnectOnError if explicitly set, otherwise
defaultReconnectOnError) so that reconnectOnError, defaultReconnectOnError and
...this.options.redis are applied in the correct order and the failover handler
is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c4d97766-1dd3-468d-86fb-5ad3635c68f1
📒 Files selected for processing (11)
.server-changes/redis-reconnect-on-readonly-loading.mdapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/redis.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsapps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/v3/handleSocketIo.server.tsinternal-packages/redis/src/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: sdk-compat / Deno Runtime
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
- GitHub Check: sdk-compat / Cloudflare Workers
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: sdk-compat / Bun Runtime
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
- GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (10)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: Access environment variables through theenvexport ofenv.server.tsinstead of directly accessingprocess.env
Use subpath exports from@trigger.dev/corepackage instead of importing from the root@trigger.dev/corepathUse named constants for sentinel/placeholder values (e.g.
const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
apps/webapp/**/*.server.ts
📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
apps/webapp/**/*.server.ts: Never userequest.signalfor detecting client disconnects. UsegetRequestAbortSignal()fromapp/services/httpAsyncStorage.server.tsinstead, which is wired directly to Expressres.on('close')and fires reliably
Access environment variables viaenvexport fromapp/env.server.ts. Never useprocess.envdirectly
Always usefindFirstinstead offindUniquein Prisma queries.findUniquehas an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance).findFirstis never batched and avoids this entire class of issues
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
{apps,internal-packages}/**/*.{ts,tsx,js}
📄 CodeRabbit inference engine (CLAUDE.md)
Use
pnpm run typecheckto verify changes in apps and internal packages (apps/*,internal-packages/*) instead ofbuild, which proves almost nothing about correctness
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
{package.json,**/*.{ts,tsx,js}}
📄 CodeRabbit inference engine (CLAUDE.md)
Pin Zod to version 3.25.76 exactly across the entire monorepo - never use a different version or version range
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
**/*.{ts,tsx,js}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{ts,tsx,js}: Import from@trigger.dev/coreusing subpaths only, never the root export
Always import tasks from@trigger.dev/sdk, never from@trigger.dev/sdk/v3or deprecatedclient.defineJob
Add crumbs to code using//@Crumbscomments or `// `#region` `@crumbsblocks for debug tracing during development
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
**/*.{ts,tsx,js,jsx,json,md,css,scss}
📄 CodeRabbit inference engine (AGENTS.md)
Code formatting is enforced using Prettier. Run
pnpm run formatbefore committing
Files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
🧠 Learnings (6)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsinternal-packages/redis/src/index.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.
Applied to files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.
Applied to files:
apps/webapp/app/services/taskIdentifierCache.server.tsapps/webapp/app/services/autoIncrementCounter.server.tsapps/webapp/app/presenters/v3/DevPresence.server.tsapps/webapp/app/services/platformNotificationCounter.server.tsapps/webapp/app/v3/handleSocketIo.server.tsapps/webapp/app/services/sessionStreamWaitpointCache.server.tsapps/webapp/app/services/realtime/redisRealtimeStreams.server.tsapps/webapp/app/services/inputStreamWaitpointCache.server.tsapps/webapp/app/redis.server.ts
📚 Learning: 2026-02-06T19:53:38.843Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/presenters/v3/DeploymentListPresenter.server.ts:233-237
Timestamp: 2026-02-06T19:53:38.843Z
Learning: When constructing Vercel dashboard URLs from deployment IDs, always strip the dpl_ prefix from the ID. Implement this by transforming the ID with .replace(/^dpl_/, "") before concatenating into the URL: https://vercel.com/${teamSlug}/${projectName}/${cleanedDeploymentId}. Consider centralizing this logic in a small helper (e.g., getVercelDeploymentId(id) or a URL builder) and add tests to verify both prefixed and non-prefixed inputs.
Applied to files:
apps/webapp/app/presenters/v3/DevPresence.server.ts
📚 Learning: 2026-03-29T19:16:28.864Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3291
File: apps/webapp/app/v3/featureFlags.ts:53-65
Timestamp: 2026-03-29T19:16:28.864Z
Learning: When reviewing TypeScript code that uses Zod v3, treat `z.coerce.*()` schemas as their direct Zod type (e.g., `z.coerce.boolean()` returns a `ZodBoolean` with `_def.typeName === "ZodBoolean"`) rather than a `ZodEffects`. Only `.preprocess()`, `.refine()`/`.superRefine()`, and `.transform()` are expected to wrap schemas in `ZodEffects`. Therefore, in reviewers’ logic like `getFlagControlType`, do not flag/unblock failures that require unwrapping `ZodEffects` when the input schema is a `z.coerce.*` schema.
Applied to files:
apps/webapp/app/v3/handleSocketIo.server.ts
🔇 Additional comments (10)
internal-packages/redis/src/index.ts (1)
6-23: Reconnect-on-error helper is cleanly centralized and correctly wired.The mapping and default option integration are consistent with the failover goal and keep client behavior uniform across consumers.
Also applies to: 31-31
.server-changes/redis-reconnect-on-readonly-loading.md (1)
1-7: Change note accurately reflects behavior and scope.This doc update is aligned with the Redis reconnect handling introduced in code.
apps/webapp/app/services/taskIdentifierCache.server.ts (1)
2-2: Good consistency update for cache Redis initialization.Applying the shared reconnect policy here keeps behavior aligned with the rest of the Redis-backed services.
Also applies to: 57-57
apps/webapp/app/v3/handleSocketIo.server.ts (1)
18-18: Socket.IO Redis adapter path is now correctly aligned with reconnect policy.Nice to see the pub/sub transport included so failover behavior is consistent across realtime traffic too.
Also applies to: 97-97
apps/webapp/app/redis.server.ts (1)
2-2: Cluster and standalone paths are both covered correctly.Wiring the same reconnect hook into both constructors avoids behavioral drift between modes.
Also applies to: 46-46, 74-74
apps/webapp/app/services/autoIncrementCounter.server.ts (1)
2-2: Reconnect policy integration inAutoIncrementCounterlooks good.The change is focused and preserves existing transactional behavior while improving failover resilience.
Also applies to: 15-15
apps/webapp/app/presenters/v3/DevPresence.server.ts (1)
2-2: Dev presence client update is straightforward and consistent.This keeps reconnect behavior aligned without altering presence semantics.
Also applies to: 11-11
apps/webapp/app/services/sessionStreamWaitpointCache.server.ts (1)
2-2: Session stream waitpoint cache wiring is correct.Good consistency with the shared reconnect strategy across Redis-backed services.
Also applies to: 32-32
apps/webapp/app/services/platformNotificationCounter.server.ts (1)
2-2: Reconnect-on-error wiring looks good.Line 22 correctly applies the shared
defaultReconnectOnErrorhandler to this Redis client and matches the failover objective.Also applies to: 22-22
apps/webapp/app/services/inputStreamWaitpointCache.server.ts (1)
2-2: Looks good — cache client now has failover-aware reconnect behavior.The added
reconnectOnError: defaultReconnectOnErroris correctly applied for this Redis instance.Also applies to: 28-28
…3549) ## Summary When ElastiCache demotes a primary to replica — during a Multi-AZ failover or a vertical node-type change — the demoting primary issues an `UNBLOCKED` reply to any in-flight blocking commands (`BLPOP`, `BRPOP`, `BLMOVE`, `XREADGROUP ... BLOCK`, etc.) to clear them before the role flips. ioredis surfaces these as `ReplyError` to caller code. The shared `defaultReconnectOnError` added in #3548 only matches `READONLY` and `LOADING`. This extends it to `UNBLOCKED` so the disconnect-reconnect-retry cycle handles BLPOP-shaped errors the same way the existing two cases handle non-blocking-command errors. ## Fix ```ts export function defaultReconnectOnError(err: Error): boolean | 1 | 2 { const msg = err.message ?? ""; if ( msg.startsWith("READONLY") || msg.startsWith("LOADING") || msg.startsWith("UNBLOCKED") ) { return 2; } return false; } ``` Returning `2` tells ioredis to disconnect, reconnect, and re-issue the command. For a BLPOP that means a fresh BLPOP against the new primary instead of the `UNBLOCKED` error escaping to the caller. ## Test plan - [ ] CI green - [ ] Trigger a Multi-AZ failover or a vertical scale event on an ElastiCache replication group whose clients are running blocking commands and confirm no `UNBLOCKED` errors surface to caller code during the cutover.
Summary
During an ElastiCache role swap (failover) or node-type change (vertical scale), the ioredis TCP/TLS connection stays open but the server starts answering with
READONLY(the client is talking to a node that became a replica) orLOADING(node still loading data from disk). Without an explicit hook, those errors surface to caller code asReplyErrorinstances — every write op on the affected connection fails until the cluster fully cuts over.This PR adds
reconnectOnErrorto every prod ioredis client so the disconnect + reconnect + retry cycle absorbs these errors and caller code never sees them.Fix
Returning
2tells ioredis to disconnect, reconnect, and re-issue the failed command. After reconnect, DNS / SG state routes the new socket to a writable node.The helper lives in
@internal/redisand is wired into both the sharedcreateRedisClient(which covers RunQueue, schedule-engine, redis-worker, and every other internal-package consumer) and the directnew Redis(...)call sites in the webapp.V1-only marqs files are intentionally not migrated.
Test plan
pnpm run typecheck --filter webapppnpm run typecheck --filter @internal/run-engine