fix(replication): prevent WAL exhaustion from slow consumers by ethervoid · Pull Request #3357 · apache/kvrocks

ethervoid · 2026-01-30T14:09:54Z

The replication feed thread could block indefinitely when sending data to a slow replica. If the replica wasn't consuming data fast enough, the TCP send buffer would fill and the feed thread would block on write() with no timeout. During this time, WAL files would rotate and be pruned, leaving the replica's sequence unavailable when the thread eventually unblocked or the connection dropped.

This commit adds three mechanisms to address the issue:

Socket send timeout: New SockSendWithTimeout() function that uses poll() to wait for socket writability with a configurable timeout (default 30 seconds). This prevents indefinite blocking.
Replication lag detection: At the start of each loop iteration, check if the replica has fallen too far behind (configurable via max-replication-lag, default 100M sequences). If exceeded, disconnect the slow consumer before WAL is exhausted, allowing psync on reconnect.
Exponential backoff on reconnection: When a replica is disconnected, it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before reconnecting. This prevents rapid reconnection loops for persistently slow replicas. The backoff resets on successful psync or fullsync.

New configuration options:

max-replication-lag: Maximum sequence lag before disconnecting (default: 100M)
replication-send-timeout-ms: Socket send timeout in ms (default: 30000)

Fixes #3356

src/common/io_util.cc

src/config/config.cc

The replication feed thread could block indefinitely when sending data to a slow replica. If the replica wasn't consuming data fast enough, the TCP send buffer would fill and the feed thread would block on write() with no timeout. During this time, WAL files would rotate and be pruned, leaving the replica's sequence unavailable when the thread eventually unblocked or the connection dropped. This commit adds three mechanisms to address the issue: 1. Socket send timeout: New SockSendWithTimeout() function that uses poll() to wait for socket writability with a configurable timeout (default 30 seconds). This prevents indefinite blocking. 2. Replication lag detection: At the start of each loop iteration, check if the replica has fallen too far behind (configurable via max-replication-lag). If exceeded, disconnect the slow consumer before WAL is exhausted, allowing psync on reconnect. Disabled by default (0), set to a positive value to enable. 3. Exponential backoff on reconnection: When a replica is disconnected, it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before reconnecting. This prevents rapid reconnection loops for persistently slow replicas. The backoff resets on successful psync or fullsync. New configuration options: - max-replication-lag: Maximum sequence lag before disconnecting (default: 0 = disabled) - replication-send-timeout-ms: Socket send timeout in ms (default: 30000) Fixes apache#3356

git-hulk · 2026-02-03T02:15:17Z

@PragmaTwice @torwig @caipengbo This PR would be helpful for users to identify what's going wrong when the replication is broken.

PragmaTwice

The code generally looks good to me.

PragmaTwice · 2026-02-03T04:14:52Z

@sryanyuan Would you like to give a review on this PR since you were working on #3340?

Adjust whitespace alignment of comments for max_replication_lag and replication_send_timeout_ms to satisfy clang-format-18 requirements.

sryanyuan · 2026-02-03T10:37:48Z

Would you like to give a review on this PR since you were working on #3340?

I’ll take a look and share any feedback I can

sryanyuan · 2026-02-03T11:28:47Z

I’ve gone through the changes and here are some observations and suggestions:

Master not logging send errors

Currently, the master does not log any send errors until WAL sequence continuation fails.

This seems to be caused by slow network transmission — each send operation takes a long time to complete, which accumulates delay over time. Eventually, WAL entries are cleaned up before the slave can catch up.
Replication lag detection config

The newly added configuration for replication lag detection to proactively disconnect a slave might help in some cases, but it may not fully solve the slow transmission problem.
Potential issue on the slave side

One major risk I see is that on the slave side, a half-open connection can remain for a long time before triggering a timeout and reconnect, which eventually leads to continuation failure.

Adding a read timeout on the slave side could help mitigate this scenario.
Master-side send timeout & efficiency improvements

Adding a send timeout on the master side could also help in disconnecting half-open slave connections earlier.

However, to truly improve transmission efficiency and avoid this situation, techniques such as compressing WAL logs before sending might be worth considering.

Overall, the changes go in the right direction for detecting and handling lag earlier, but I think addressing connection timeout handling (both master and slave) and optimizing WAL transmission could make the solution more robust.

git-hulk · 2026-02-03T12:20:08Z

This PR also looks good to me. @ethervoid We need to fix this lint error in CI.

The TestClusterReset test was failing on macOS ARM because the slot migration completed before the test could observe the "start" state. Reduce migrate-speed from 128 to 64 and increase data size from 1024 to 2048 elements to ensure the migration takes long enough to observe intermediate states on fast hardware (~32 seconds vs ~8 seconds).

sonarqubecloud · 2026-02-03T14:42:01Z

Quality Gate passed

Issues
7 New issues
0 Accepted issues

Measures
0 Security Hotspots
65.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ethervoid · 2026-02-03T14:42:12Z

@git-hulk All the tests working

@sryanyuan Thank you for the thorough report 🙇 . I'll comment on your feedback

Master not logging send errors

Currently, the master does not log any send errors until WAL sequence continuation fails.

This seems to be caused by slow network transmission — each send operation takes a long time to complete, which accumulates delay over time. Eventually, WAL entries are cleaned up before the slave can catch up.

I think this is addressed by the changes in this PR. In the SockSendWithTimeout method, we're logging an error is logged when the send fails, including by timeouts

Replication lag detection config

The newly added configuration for replication lag detection to proactively disconnect a slave might help in some cases, but it may not fully solve the slow transmission problem.

Yeah, agree. This change fixes the symptom but doesn't fixes the root cause that could be outside the application control.

Potential issue on the slave side

One major risk I see is that on the slave side, a half-open connection can remain for a long time before triggering a timeout and reconnect, which eventually leads to continuation failure.

I can work on another PR to include slave-side timeouts too

Adding a read timeout on the slave side could help mitigate this scenario.

Master-side send timeout & efficiency improvements

Adding a send timeout on the master side could also help in disconnecting half-open slave connections earlier.

However, to truly improve transmission efficiency and avoid this situation, techniques such as compressing WAL logs before sending might be worth considering.

Agree with it. That would be a good feature to send the WAL batches compressed

ethervoid force-pushed the fix_replication_falling_behind_and_freeze branch 7 times, most recently from c6c48b7 to 696eaf7 Compare January 30, 2026 21:42

git-hulk requested review from PragmaTwice, caipengbo and git-hulk February 2, 2026 02:43

git-hulk reviewed Feb 2, 2026

View reviewed changes

src/common/io_util.cc Show resolved Hide resolved

src/common/io_util.cc Show resolved Hide resolved

src/config/config.cc Outdated Show resolved Hide resolved

ethervoid force-pushed the fix_replication_falling_behind_and_freeze branch from 696eaf7 to fcac438 Compare February 2, 2026 09:34

ethervoid force-pushed the fix_replication_falling_behind_and_freeze branch from fcac438 to c392ce7 Compare February 2, 2026 11:43

git-hulk requested a review from torwig February 3, 2026 02:08

PragmaTwice reviewed Feb 3, 2026

View reviewed changes

ethervoid and others added 2 commits February 3, 2026 11:28

style(config): fix clang-format comment alignment

cb920d8

Adjust whitespace alignment of comments for max_replication_lag and replication_send_timeout_ms to satisfy clang-format-18 requirements.

Merge branch 'unstable' into fix_replication_falling_behind_and_freeze

76a9dad

ethervoid force-pushed the fix_replication_falling_behind_and_freeze branch from 3c00661 to 6f2ed24 Compare February 3, 2026 12:58

git-hulk approved these changes Feb 4, 2026

View reviewed changes

git-hulk merged commit 559a667 into apache:unstable Feb 4, 2026
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(replication): prevent WAL exhaustion from slow consumers#3357

fix(replication): prevent WAL exhaustion from slow consumers#3357
git-hulk merged 4 commits intoapache:unstablefrom
ethervoid:fix_replication_falling_behind_and_freeze

ethervoid commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

git-hulk commented Feb 3, 2026

Uh oh!

PragmaTwice left a comment

Uh oh!

PragmaTwice commented Feb 3, 2026 •

edited

Loading

Uh oh!

sryanyuan commented Feb 3, 2026

Uh oh!

sryanyuan commented Feb 3, 2026

Uh oh!

git-hulk commented Feb 3, 2026

Uh oh!

sonarqubecloud bot commented Feb 3, 2026

Uh oh!

ethervoid commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

ethervoid commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

git-hulk commented Feb 3, 2026

Uh oh!

PragmaTwice left a comment

Choose a reason for hiding this comment

Uh oh!

PragmaTwice commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sryanyuan commented Feb 3, 2026

Uh oh!

sryanyuan commented Feb 3, 2026

Uh oh!

git-hulk commented Feb 3, 2026

Uh oh!

sonarqubecloud bot commented Feb 3, 2026

Quality Gate passed

Uh oh!

ethervoid commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PragmaTwice commented Feb 3, 2026 •

edited

Loading