fix(replication): prevent WAL exhaustion from slow consumers#3357
Conversation
c6c48b7 to
696eaf7
Compare
696eaf7 to
fcac438
Compare
The replication feed thread could block indefinitely when sending data to a slow replica. If the replica wasn't consuming data fast enough, the TCP send buffer would fill and the feed thread would block on write() with no timeout. During this time, WAL files would rotate and be pruned, leaving the replica's sequence unavailable when the thread eventually unblocked or the connection dropped. This commit adds three mechanisms to address the issue: 1. Socket send timeout: New SockSendWithTimeout() function that uses poll() to wait for socket writability with a configurable timeout (default 30 seconds). This prevents indefinite blocking. 2. Replication lag detection: At the start of each loop iteration, check if the replica has fallen too far behind (configurable via max-replication-lag). If exceeded, disconnect the slow consumer before WAL is exhausted, allowing psync on reconnect. Disabled by default (0), set to a positive value to enable. 3. Exponential backoff on reconnection: When a replica is disconnected, it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before reconnecting. This prevents rapid reconnection loops for persistently slow replicas. The backoff resets on successful psync or fullsync. New configuration options: - max-replication-lag: Maximum sequence lag before disconnecting (default: 0 = disabled) - replication-send-timeout-ms: Socket send timeout in ms (default: 30000) Fixes apache#3356
fcac438 to
c392ce7
Compare
|
@PragmaTwice @torwig @caipengbo This PR would be helpful for users to identify what's going wrong when the replication is broken. |
PragmaTwice
left a comment
There was a problem hiding this comment.
The code generally looks good to me.
|
@sryanyuan Would you like to give a review on this PR since you were working on #3340? |
Adjust whitespace alignment of comments for max_replication_lag and replication_send_timeout_ms to satisfy clang-format-18 requirements.
I’ll take a look and share any feedback I can |
|
I’ve gone through the changes and here are some observations and suggestions:
Overall, the changes go in the right direction for detecting and handling lag earlier, but I think addressing connection timeout handling (both master and slave) and optimizing WAL transmission could make the solution more robust. |
|
This PR also looks good to me. @ethervoid We need to fix this lint error in CI. |
The TestClusterReset test was failing on macOS ARM because the slot migration completed before the test could observe the "start" state. Reduce migrate-speed from 128 to 64 and increase data size from 1024 to 2048 elements to ensure the migration takes long enough to observe intermediate states on fast hardware (~32 seconds vs ~8 seconds).
3c00661 to
6f2ed24
Compare
|
|
@git-hulk All the tests working @sryanyuan Thank you for the thorough report 🙇 . I'll comment on your feedback
I think this is addressed by the changes in this PR. In the SockSendWithTimeout method, we're logging an error is logged when the send fails, including by timeouts
Yeah, agree. This change fixes the symptom but doesn't fixes the root cause that could be outside the application control.
I can work on another PR to include slave-side timeouts too
Agree with it. That would be a good feature to send the WAL batches compressed |



The replication feed thread could block indefinitely when sending data to a slow replica. If the replica wasn't consuming data fast enough, the TCP send buffer would fill and the feed thread would block on write() with no timeout. During this time, WAL files would rotate and be pruned, leaving the replica's sequence unavailable when the thread eventually unblocked or the connection dropped.
This commit adds three mechanisms to address the issue:
Socket send timeout: New SockSendWithTimeout() function that uses poll() to wait for socket writability with a configurable timeout (default 30 seconds). This prevents indefinite blocking.
Replication lag detection: At the start of each loop iteration, check if the replica has fallen too far behind (configurable via max-replication-lag, default 100M sequences). If exceeded, disconnect the slow consumer before WAL is exhausted, allowing psync on reconnect.
Exponential backoff on reconnection: When a replica is disconnected, it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before reconnecting. This prevents rapid reconnection loops for persistently slow replicas. The backoff resets on successful psync or fullsync.
New configuration options:
Fixes #3356