Skip to content

Comments

fix(replication): prevent WAL exhaustion from slow consumers#3357

Merged
git-hulk merged 4 commits intoapache:unstablefrom
ethervoid:fix_replication_falling_behind_and_freeze
Feb 4, 2026
Merged

fix(replication): prevent WAL exhaustion from slow consumers#3357
git-hulk merged 4 commits intoapache:unstablefrom
ethervoid:fix_replication_falling_behind_and_freeze

Conversation

@ethervoid
Copy link
Contributor

The replication feed thread could block indefinitely when sending data to a slow replica. If the replica wasn't consuming data fast enough, the TCP send buffer would fill and the feed thread would block on write() with no timeout. During this time, WAL files would rotate and be pruned, leaving the replica's sequence unavailable when the thread eventually unblocked or the connection dropped.

This commit adds three mechanisms to address the issue:

  1. Socket send timeout: New SockSendWithTimeout() function that uses poll() to wait for socket writability with a configurable timeout (default 30 seconds). This prevents indefinite blocking.

  2. Replication lag detection: At the start of each loop iteration, check if the replica has fallen too far behind (configurable via max-replication-lag, default 100M sequences). If exceeded, disconnect the slow consumer before WAL is exhausted, allowing psync on reconnect.

  3. Exponential backoff on reconnection: When a replica is disconnected, it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before reconnecting. This prevents rapid reconnection loops for persistently slow replicas. The backoff resets on successful psync or fullsync.

New configuration options:

  • max-replication-lag: Maximum sequence lag before disconnecting (default: 100M)
  • replication-send-timeout-ms: Socket send timeout in ms (default: 30000)

Fixes #3356

@ethervoid ethervoid force-pushed the fix_replication_falling_behind_and_freeze branch 7 times, most recently from c6c48b7 to 696eaf7 Compare January 30, 2026 21:42
@ethervoid ethervoid force-pushed the fix_replication_falling_behind_and_freeze branch from 696eaf7 to fcac438 Compare February 2, 2026 09:34
The replication feed thread could block indefinitely when sending data
to a slow replica. If the replica wasn't consuming data fast enough,
the TCP send buffer would fill and the feed thread would block on
write() with no timeout. During this time, WAL files would rotate and
be pruned, leaving the replica's sequence unavailable when the thread
eventually unblocked or the connection dropped.

This commit adds three mechanisms to address the issue:

1. Socket send timeout: New SockSendWithTimeout() function that uses
   poll() to wait for socket writability with a configurable timeout
   (default 30 seconds). This prevents indefinite blocking.

2. Replication lag detection: At the start of each loop iteration,
   check if the replica has fallen too far behind (configurable via
   max-replication-lag). If exceeded, disconnect the slow consumer
   before WAL is exhausted, allowing psync on reconnect.
   Disabled by default (0), set to a positive value to enable.

3. Exponential backoff on reconnection: When a replica is disconnected,
   it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before
   reconnecting. This prevents rapid reconnection loops for persistently
   slow replicas. The backoff resets on successful psync or fullsync.

New configuration options:
- max-replication-lag: Maximum sequence lag before disconnecting (default: 0 = disabled)
- replication-send-timeout-ms: Socket send timeout in ms (default: 30000)

Fixes apache#3356
@ethervoid ethervoid force-pushed the fix_replication_falling_behind_and_freeze branch from fcac438 to c392ce7 Compare February 2, 2026 11:43
@git-hulk git-hulk requested a review from torwig February 3, 2026 02:08
@git-hulk
Copy link
Member

git-hulk commented Feb 3, 2026

@PragmaTwice @torwig @caipengbo This PR would be helpful for users to identify what's going wrong when the replication is broken.

Copy link
Member

@PragmaTwice PragmaTwice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code generally looks good to me.

@PragmaTwice
Copy link
Member

PragmaTwice commented Feb 3, 2026

@sryanyuan Would you like to give a review on this PR since you were working on #3340?

ethervoid and others added 2 commits February 3, 2026 11:28
Adjust whitespace alignment of comments for max_replication_lag and
replication_send_timeout_ms to satisfy clang-format-18 requirements.
@sryanyuan
Copy link
Contributor

Would you like to give a review on this PR since you were working on #3340?

I’ll take a look and share any feedback I can

@sryanyuan
Copy link
Contributor

I’ve gone through the changes and here are some observations and suggestions:

  • Master not logging send errors

    Currently, the master does not log any send errors until WAL sequence continuation fails.

    This seems to be caused by slow network transmission — each send operation takes a long time to complete, which accumulates delay over time. Eventually, WAL entries are cleaned up before the slave can catch up.

  • Replication lag detection config

    The newly added configuration for replication lag detection to proactively disconnect a slave might help in some cases, but it may not fully solve the slow transmission problem.

  • Potential issue on the slave side

    One major risk I see is that on the slave side, a half-open connection can remain for a long time before triggering a timeout and reconnect, which eventually leads to continuation failure.

    Adding a read timeout on the slave side could help mitigate this scenario.

  • Master-side send timeout & efficiency improvements

    Adding a send timeout on the master side could also help in disconnecting half-open slave connections earlier.

    However, to truly improve transmission efficiency and avoid this situation, techniques such as compressing WAL logs before sending might be worth considering.

Overall, the changes go in the right direction for detecting and handling lag earlier, but I think addressing connection timeout handling (both master and slave) and optimizing WAL transmission could make the solution more robust.

@git-hulk
Copy link
Member

git-hulk commented Feb 3, 2026

This PR also looks good to me. @ethervoid We need to fix this lint error in CI.

The TestClusterReset test was failing on macOS ARM because the slot
migration completed before the test could observe the "start" state.

Reduce migrate-speed from 128 to 64 and increase data size from 1024
to 2048 elements to ensure the migration takes long enough to observe
intermediate states on fast hardware (~32 seconds vs ~8 seconds).
@ethervoid ethervoid force-pushed the fix_replication_falling_behind_and_freeze branch from 3c00661 to 6f2ed24 Compare February 3, 2026 12:58
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 3, 2026

@ethervoid
Copy link
Contributor Author

@git-hulk All the tests working

@sryanyuan Thank you for the thorough report 🙇 . I'll comment on your feedback

Master not logging send errors

Currently, the master does not log any send errors until WAL sequence continuation fails.

This seems to be caused by slow network transmission — each send operation takes a long time to complete, which accumulates delay over time. Eventually, WAL entries are cleaned up before the slave can catch up.

I think this is addressed by the changes in this PR. In the SockSendWithTimeout method, we're logging an error is logged when the send fails, including by timeouts

Replication lag detection config

The newly added configuration for replication lag detection to proactively disconnect a slave might help in some cases, but it may not fully solve the slow transmission problem.

Yeah, agree. This change fixes the symptom but doesn't fixes the root cause that could be outside the application control.

Potential issue on the slave side

One major risk I see is that on the slave side, a half-open connection can remain for a long time before triggering a timeout and reconnect, which eventually leads to continuation failure.

I can work on another PR to include slave-side timeouts too

Adding a read timeout on the slave side could help mitigate this scenario.

Master-side send timeout & efficiency improvements

Adding a send timeout on the master side could also help in disconnecting half-open slave connections earlier.

However, to truly improve transmission efficiency and avoid this situation, techniques such as compressing WAL logs before sending might be worth considering.

Agree with it. That would be a good feature to send the WAL batches compressed

@git-hulk git-hulk merged commit 559a667 into apache:unstable Feb 4, 2026
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replication feed thread blocks indefinitely on slow consumer, causing WAL exhaustion and forced fullsync

4 participants