Skip to content

WAL wrapper improvements#2934

Open
cody-littley wants to merge 17 commits intomainfrom
cody-littley/wal-improvements
Open

WAL wrapper improvements#2934
cody-littley wants to merge 17 commits intomainfrom
cody-littley/wal-improvements

Conversation

@cody-littley
Copy link

@cody-littley cody-littley commented Feb 19, 2026

Describe your changes and provide context

Ticket: https://linear.app/seilabs/issue/STO-361/improve-wal-library-correctness

This PR fixes several issues identified in this thread: https://sei-network-io.slack.com/archives/C08KPKN3SLT/p1771432603483039?thread_ts=1771432091.719279&cid=C08KPKN3SLT

  • Async vs. non-async write mode is now an explicit toggle in settings
    • legacy code that expects the write buffer size to control async behavior has been modified to maintain legacy behavior
  • Added toggles for fsync mode and shallow copy mode.
  • Threading model altered. Instead of using mutexes, all work is performed on a background goroutine, and communication with this goroutine is mediated with channels.
    • Fixes identified thread safety issues.
    • This refactor is not the only way to fix the issues identified in the thread, and so if people don't like this change I'm willing to roll to the old mutex-based implementation.
    • My personal opinion: when it comes to reasoning about multi-threaded code, this pattern is easier to reason about than using mutexes. Since there is only one thread "doing things", it's trivial to convince yourself of its safety.
    • This threading change can sometimes lead to performance improvements. But for this specific case, I think it highly unlikely that this will have a performance impact. It's very unlikely that we will see mutex contention impact performance in this code, and so performance will likely be near identical to a mutex-based implementation.
  • Added behind-the-scenes batching. Top level API remains identical. But if there is a bunch of work that builds up in the queue, the worker thread aggregates multiple writes into larger batches.
    • Large impact for small writes in async mode. 2-3x throughput improvement with writes less than 10kb.
    • Impact tapers off the larger the writes get. This is expected.
    • Extreme impact (orders of magnitude) when writing with fsync mode enabled, since fsync is super expensive and this lets us do fewer fsync operations.
      • For the consensus team's use case, this is unlikely to help much though, since they will be doing one WAL per lane and blocking on each write before proceeding to the next write.
      • If we move to all lanes in a single WAL, then consensus team will likely realize a large speedup from batching.
      • It's possible to make the consensus team's use case much faster, but to do so I'd need to completely change wal.go's API and semantics. I didn't think that appropriate in this PR.

Testing performed to validate your change

I utilized @yzang2019's benchmark. My changes had no impact on performance in some cases, and had a strong positive impact in others.

Screenshot 2026-02-18 at 3 54 48 PM

@github-actions
Copy link

github-actions bot commented Feb 19, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedFeb 19, 2026, 9:08 PM

@codecov
Copy link

codecov bot commented Feb 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.42%. Comparing base (7f57cdb) to head (87772b8).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2934       +/-   ##
===========================================
+ Coverage   57.35%   68.42%   +11.07%     
===========================================
  Files        2095        5     -2090     
  Lines      172870      456   -172414     
===========================================
- Hits        99142      312    -98830     
+ Misses      64842      114    -64728     
+ Partials     8886       30     -8856     
Flag Coverage Δ
sei-chain ?
sei-cosmos ?
sei-db 68.42% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 2090 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.


// Legacy compatibility: originally, BufferSize > 0 implied async writes. Preserve that for
// callers who set BufferSize without explicitly setting AsyncWrites.
if config.BufferSize > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If already we have BufferSize to indicate async or not, why do we still need to add another config for AsyncWrites (true/false) here? Isn't that a redundant config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we would only use the buffer when running in async mode. After my refactor, we always use the buffer regardless of whether we are running in async mode. This means we need to configure buffer size independently of async mode configuration.

My reasoning behind always using the buffer is twofold. It reduces complexity, since there is one fewer code pathway. Additionally, it allows for batching (internal) even when running in sync mode (although you need multiple writer goroutines to take advantage of batching in sync mode).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you feel strongly about this. I can revert back to the old pattern of configuration without a huge rewrite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look at the new code path first for how buffer is being used now, thanks for explanation!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already pushed a change to revert to the previous config pattern. We can still use the old pattern, you'd just lose the ability to configure internal buffer sizes when you are running in sync mode. Perhaps not too much of a problem? Since internal buffer size is unlikely to be important super important when running in sync mode.


bufferSize := config.BufferSize
if config.BufferSize == 0 {
bufferSize = 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a const for these default value? Same as writeBatchSize

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Const added for both.

PruneInterval time.Duration

// If true, the writes are asynchronous, and will return immediately without waiting for the write to be durable
AsyncWrites bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it sometimes could be confusing to have two configs to control one feature, user don't know which one dominates without looking into the code. Would suggest still keep only one here which is the bufferSize > 0 to indicate whether we are doing async writes or not

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config reverted to a single setting, as suggested.

func (walLog *WAL[T]) mainLoop() {

pruneInterval := walLog.config.PruneInterval
if pruneInterval < time.Second {
Copy link
Contributor

@yzang2019 yzang2019 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where we want to disable auto pruning? I think previously we have assumption if pruneInterval <= 0 then we will disable pruning, and we still need to support that case right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still supported, but I can see how it is confusing in its current form. Currently, prune() becomes a noop if it is disabled:

func (walLog *WAL[T]) prune() {
	keepRecent := walLog.config.KeepRecent
	if keepRecent <= 0 || walLog.config.PruneInterval <= 0 {
		// pruning is disabled
		return
	}

I've refactored the code to make it a more obvious that pruning is disabled.

	var pruneChan <-chan time.Time
	if walLog.config.PruneInterval > 0 && walLog.config.KeepRecent > 0 {
		pruneTicker := time.NewTicker(walLog.config.PruneInterval)
		defer pruneTicker.Stop()
		pruneChan = pruneTicker.C
	}

// ...

		case <-pruneChan:
			walLog.prune()

The pruneChan is nil if pruning is not configured. <- nil is legal in go, it will just never be selected/executed.

@blindchaser
Copy link
Contributor

I like the single-writer event loop design for writes/truncates.

one question, all reads (ReadAt, Replay, FirstOffset, LastOffset) are now also serialized through the mainLoop. In the old code these were direct calls to tidwall/wal. Have you considered keeping reads as direct calls? It would eliminate ~5 request/response types and simplify the code. I can see the argument for keeping them serialized (avoids read/truncate interleave), but given that reads are mostly only called at startup, curious about your thinking here.

@cody-littley
Copy link
Author

one question, all reads (ReadAt, Replay, FirstOffset, LastOffset) are now also serialized through the mainLoop. In the old code these were direct calls to tidwall/wal. Have you considered keeping reads as direct calls? It would eliminate ~5 request/response types and simplify the code. I can see the argument for keeping them serialized (avoids read/truncate interleave), but given that reads are mostly only called at startup, curious about your thinking here.

Converted the read methods back into direct calls.

}

// drain pending work, then tear down
walLog.drain()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we do this instead to be safer?
walLog.cancel() // 1. stop accepting new requests
walLog.drain() // 2. finish in-flight work already in the buffer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between drain() returning and cancel() being called , the context is still active. A concurrent caller (e.g. Write()) can successfully send a request into the buffered channel because the context check in Write() passes. But mainLoop has already stopped reading — that request is silently dropped and its errChan will never receive a response.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that even if a caller manages to submit a message to the buffer after Close() is called but before the context is cancelled, they will not block forever waiting for a response. When waiting for a response, we abort early if the context is cancelled:

	select {
	case _, ok := <-walLog.ctx.Done():
		if !ok {
			return fmt.Errorf("WAL was closed after write was submitted but before write was finalized, " +
				"write may or may not be durable")
		}
	case err := <-req.errChan:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I don't think there is any harm in cancelling the context before draining additional work. Change made.

}

select {
case _, ok := <-walLog.ctx.Done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually just do this:
case <-walLog.ctx.Done():
return err

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. In this specific case, there is no pre-existing err to return.

In order to make this code easier to read, I created some helper methods for inserting and removing things from channels. Eventually I'll probably extract these methods to a utility package, but for the time being I put them inside wal.go since there are no other callers.

// Push to a channel, returning an error if the context is cancelled before the value is pushed.
func interuptablePush[T any](ctx context.Context, ch chan T, value T) error {
	select {
	case <-ctx.Done():
		return fmt.Errorf("context cancelled: %w", ctx.Err())
	case ch <- value:
		return nil
	}
}

// Pull from a channel, returning an error if the context is cancelled before the value is pulled.
func interuptablePull[T any](ctx context.Context, ch <-chan T) (T, error) {
	var zero T
	select {
	case <-ctx.Done():
		return zero, fmt.Errorf("context cancelled: %w", ctx.Err())
	case value, ok := <-ch:
		if !ok {
			return zero, fmt.Errorf("channel closed")
		}
		return value, nil
	}
}

Copy link
Contributor

@yzang2019 yzang2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!

asyncWriteErrCh: make(chan error, 1),

bufferSize := config.WriteBufferSize
if config.WriteBufferSize == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be config.WriteBufferSize <= 0? I think memiavl sometimes passes -1 to mean sync mode. With == 0, negative values can still flow into make(chan, size) and panic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like it's breaking some unit test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed

Copy link

@bdchatham bdchatham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aligned with the channel based approach


// Shut down the WAL. Sends a close request to the main loop so in-flight writes (and other work)
// can complete before teardown. Idempotent.
func (walLog *WAL[T]) Close() error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isn't really thread safe although the current call pattern is safe. Could be more defensive/explicit to use sync.once. Two racing threads could try to put multiple entries into the channel. One will hang.

Copy link
Author

@cody-littley cody-littley Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is thread safe even if you have concurrent Close() calls. The first caller will send a message to the background thread. Other callers may become temporarily blocked. But once the background thread cancels the context, all other callers to Close() will stop waiting.

I added a unit test to verify this behavior.

	// Calling close lots of times shouldn't cause any problems.
	for i := 0; i < 10; i++ {
		require.NoError(t, changelog.Close())
	}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it serially is definitely no problem.

The callout is that thread safety of Close() is implicitly dependent on properly handling closeErrChan. If you protect it differently you don't have to have this implicit dependency.

Not a critical problem but wanted to mention the alternative.

lastOffset++
}

if err := walLog.log.WriteBatch(batch); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems problematic to return the error to all error channels since some of them may have been successfully written. In the case the wal implementation spans multiple segments then you could write sub-batches but return the error only for the last segment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is important to differentiate which writes became durable on disk and which ones didn't within a batch. In general, I'm not sure that this is a recoverable error if the WAL stops being able to accept new writes, since we're probably bumping into something like a full disk.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah. I don't have as much context as I'd like. Idk entirely how the WAL is used, but my main concern was the possibility of writing something to it but reporting to the caller that there was actually an error and what that caller will do with that signal

truncateChan: make(chan *truncateRequest, bufferSize),
}

go w.mainLoop()

Check notice

Code scanning / CodeQL

Spawning a Go routine Note

Spawning a Go routine may be a possible source of non-determinism
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments