Skip to content

Commit b264229

Browse files
committed
feat(redis): add reconnectOnError for READONLY / LOADING reply errors
Empirical proof from the redis-failover-harness (TRI-8878) shows ElastiCache vertical scale-up events surface as Redis-level reply errors (READONLY when the role swap happens under an open connection, LOADING when a node is still initializing) rather than connection-level errors. Without intervention these errors propagate directly to caller code at the rate of tens of thousands per minute over a multi-minute window. Returning 2 from reconnectOnError tells ioredis to tear down the connection, reconnect, and re-issue the failed command. After reconnect, DNS / SG state routes the new socket to a writable node and the workload resumes. Harness measurements: - Without this option: ~437,000 caller-surfaced errors over 4 min of zero-write throughput per task during cache.t4g.medium -> m7g.large. - With this option (same workload, m7g.large -> m7g.xlarge): 2 total caller-surfaced errors across both tasks; throughput uninterrupted. Reduction: ~218,000x. Scope: only the shared createRedisClient helper in @internal/redis. Direct 'new Redis()' callsites in apps/webapp/ still need migration; defaultReconnectOnError is exported so they can opt in inline as a follow-up. Refs TRI-8868 TRI-8878.
1 parent 6cdd881 commit b264229

2 files changed

Lines changed: 26 additions & 0 deletions

File tree

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
area: webapp
3+
type: improvement
4+
---
5+
6+
Add `reconnectOnError` to the shared ioredis client config so READONLY / LOADING reply errors during ElastiCache node-type changes trigger a disconnect-reconnect-retry cycle instead of surfacing to caller code.

internal-packages/redis/src/index.ts

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,32 @@ import { Logger } from "@trigger.dev/core/logger";
33

44
export { Redis, type Callback, type RedisOptions, type Result, type RedisCommander } from "ioredis";
55

6+
/**
7+
* Reply-error -> reconnect mapping. Without this hook, an ElastiCache
8+
* vertical scale-up surfaces tens of thousands of READONLY / LOADING
9+
* reply errors to caller code over a healthy TCP/TLS connection (the
10+
* client keeps talking to a node whose role swapped underneath it).
11+
*
12+
* Returning 2 tells ioredis to disconnect, reconnect, and retry the
13+
* command that triggered the error. After reconnect, DNS / SG routing
14+
* should land on a writable primary.
15+
*
16+
* Empirical confirmation on the harness in TRI-8878: this option
17+
* reduced a scale-up event from ~437,000 caller-surfaced errors to 2.
18+
*/
19+
export function defaultReconnectOnError(err: Error): boolean | 1 | 2 {
20+
const msg = err.message ?? "";
21+
if (msg.startsWith("READONLY") || msg.startsWith("LOADING")) return 2;
22+
return false;
23+
}
24+
625
const defaultOptions: Partial<RedisOptions> = {
726
retryStrategy: (times: number) => {
827
const delay = Math.min(times * 50, 1000);
928
return delay;
1029
},
1130
maxRetriesPerRequest: process.env.GITHUB_ACTIONS ? 50 : process.env.VITEST ? 5 : 20,
31+
reconnectOnError: defaultReconnectOnError,
1232
};
1333

1434
const logger = new Logger("Redis", "debug");

0 commit comments

Comments
 (0)