Skip to content

Fix shutdown race: abort background tasks before closing durability#4581

Open
clockwork-labs-bot wants to merge 1 commit intomasterfrom
bot/fix-shutdown-task-race
Open

Fix shutdown race: abort background tasks before closing durability#4581
clockwork-labs-bot wants to merge 1 commit intomasterfrom
bot/fix-shutdown-task-race

Conversation

@clockwork-labs-bot
Copy link
Collaborator

Summary

Fixes a race condition where the view_cleanup_task can panic with "durability actor vanished" during database shutdown, crashing the server on Windows.

Root Cause

The shutdown sequence in HostController::exit_module was:

  1. module.exit().await
  2. db.shutdown().await — closes the durability channel
  3. Host::drop — aborts background tasks (view cleanup, metrics)

The view_cleanup_task runs with_auto_commit() on a loop, which calls request_durability(). If the task fires between steps 2 and 3, request_durability() panics because the durability channel is already closed.

Fix

Abort all background tasks before calling db.shutdown(), so they cannot race with durability channel closure:

  1. module.exit().await
  2. Abort background tasks (view cleanup, disk metrics, tx metrics)
  3. db.shutdown().await — closes the durability channel

The tasks are still aborted again in Host::drop (no-op since already aborted).

Testing

This fixes flaky test_all_templates failures on Windows CI, such as:
https://github.com/clockworklabs/SpacetimeDB/actions/runs/22745918903/job/65969841716?pr=4376

The failure pattern: server panics at durability.rs:96 ("durability actor vanished"), server process dies, all subsequent template tests get connection refused.

The view_cleanup_task runs with_auto_commit() on a loop, which calls
request_durability(). If db.shutdown() closes the durability channel
before the task is aborted (in Host::drop), a request_durability()
call panics with 'durability actor vanished'. On Windows, this can
crash the server process.

Fix: abort all background tasks (view_cleanup, disk_metrics,
tx_metrics) before calling db.shutdown(), so they cannot race with
durability channel closure.

Fixes flaky test_all_templates failures on Windows CI.
@bfops bfops requested a review from kim March 6, 2026 22:16
Copy link
Collaborator

@bfops bfops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me 🤷

Copy link
Contributor

@kim kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nonsense -- the panic doesn't crash the server, it only panics a tokio task. Like most of these bot analyses, the premise is just completely wrong.

That said, if we prefer the stack trace to go away for noisiness reasons, the right way to do that is to drop the host before shutting down the database.

It will not make the test failure go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants