Sync stream before destroying GPU representation in convert_to by mbrobbel · Pull Request #103 · NVIDIA/cuCascade

mbrobbel · 2026-04-13T10:25:29Z

Summary

Conversions from GPU in data_batch::convert_to may enqueue async operations on the provided CUDA stream that read from the source GPU memory. Previously, the source GPU representation was destroyed immediately after reassigning _data, risking use-after-free or leaking buffers into process-lifetime static storage.
Synchronize the stream before the old GPU representation is destroyed to ensure async operations complete.
Use unique_lock instead of lock_guard to release the mutex before synchronizing, avoiding blocking other threads during the stream sync.
The sync condition triggers for any conversion from a GPU source, not just GPU→HOST — this covers future converters (e.g. GPU→DISK) and custom converters that may enqueue async reads without synchronizing internally.

Test plan

New test: stream is synchronized before old GPU representation is destroyed
- Passes with fix, fails without fix (synced_before_destroy = false)
New test: mutex is released before stream sync, allowing concurrent access
- Passes with early unlock, fails when sync is inside the lock (accessed_during_sync = false)
- Uses cudaLaunchHostFunc callback (50 ms sleep) for a deterministic window independent of GPU speed
Without any fix: first test fails (no sync), second test passes (no lock contention since no sync)
Verify full test suite passes

Local test results

Configuration	Test 1 (sync before destroy)	Test 2 (mutex released during sync)
With fix (early unlock + sync)	✅ Pass	✅ Pass
No fix (original code)	❌ Fail — no sync at all	✅ Pass — no lock contention since no sync
Sync inside lock (naive fix)	✅ Pass — sync happens	❌ Fail — mutex held during 50 ms sync

Only the correct implementation (early unlock + sync) passes both tests.

Raw test output

With fix (early unlock + sync):

Filters: [convert_to]
===============================================================================
All tests passed (3 assertions in 2 test cases)

Without fix (original code, no sync at all):

/home/matthijs/sirius/cucascade/test/data/test_data_batch.cpp:1753: FAILED:
  REQUIRE( observer.synced_before_destroy )
with expansion:
  false

test cases: 2 | 1 passed | 1 failed
assertions: 3 | 2 passed | 1 failed

With sync inside lock (correct sync, but naive lock_guard):

/home/matthijs/sirius/cucascade/test/data/test_data_batch.cpp:1849: FAILED:
  REQUIRE( accessed_during_sync )
with expansion:
  false

test cases: 2 | 1 passed | 1 failed
assertions: 2 | 1 passed | 1 failed

How the tests work

Test 1 — sync before destroy: Registers a custom converter that enqueues a 4 MB async cudaMemcpyDeviceToHost into pinned memory and records a CUDA event, but does not synchronize. A custom observed_gpu_representation checks cudaEventQuery in its destructor. Without the sync, the event is still pending at destruction time.

Test 2 — mutex released before sync: Same converter setup, plus a cudaLaunchHostFunc callback that sleeps for 50 ms on the stream (creating a deterministic delay independent of GPU speed). A second thread calls get_state() (which acquires the mutex) and checks the event. With the early unlock, get_state() returns immediately while the 50 ms callback is still running. With the sync inside the lock, get_state() is blocked for the full 50 ms.

🤖 Created by Claude

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

GPU->HOST conversions may enqueue async copies on the provided stream. Synchronize the stream before the source GPU representation is destroyed to avoid reading freed memory or leaking buffers into process-lifetime static storage. Uses unique_lock to release the mutex before synchronizing, so other threads aren't blocked during the stream sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a test that verifies convert_to synchronizes the stream before destroying the old GPU representation during GPU-to-HOST conversion. The test uses a custom converter that deliberately does not synchronize after async D2H copies, and an observed GPU representation whose destructor checks whether the CUDA event (recorded after the async copy) was already complete at destruction time. Without the stream sync in convert_to, the old representation is destroyed while the async copy is still in-flight, and the event query in the destructor reports incomplete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verifies that convert_to releases the mutex before synchronizing the stream on GPU-to-HOST conversions, allowing other threads to access the batch during the sync. Uses a cudaLaunchHostFunc callback that sleeps for 50 ms to create a deterministic window. A second thread calls get_state() (which takes the mutex) and checks whether the CUDA event is still pending, proving the mutex was released before the stream sync completed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dhruv9vats

Thanks for catching this!

include/cucascade/data/data_batch.hpp

Sync the stream whenever the old representation is on GPU, not just for GPU-to-HOST conversions. Future converters (e.g. GPU-to-DISK) or custom converters may also enqueue async reads from GPU source memory. Updated test names and comments to reflect the broader condition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dhruv9vats · 2026-04-13T14:17:53Z

include/cucascade/data/data_batch.hpp

  _data = std::move(new_representation);
+
+  bool needs_sync =
+    old_representation != nullptr && old_representation->get_current_tier() == memory::Tier::GPU;


do we not want to cover the to GPU memory case too?

Conversions to GPU (e.g. HOST→GPU) also enqueue async copies that read from the source buffer. Sync the stream whenever either side is GPU to prevent use-after-free when the old representation is destroyed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

felipeblazing · 2026-04-14T20:43:45Z

/merge

mbrobbel force-pushed the sync-gpu-to-host-conversion branch from 6b423cc to cd6a553 Compare April 13, 2026 10:32

mbrobbel added breaking improvement labels Apr 13, 2026

mbrobbel and others added 2 commits April 13, 2026 12:35

mbrobbel requested review from aminaramoon and felipeblazing April 13, 2026 11:37

dhruv9vats reviewed Apr 13, 2026

View reviewed changes

include/cucascade/data/data_batch.hpp Outdated Show resolved Hide resolved

dhruv9vats reviewed Apr 14, 2026

View reviewed changes

felipeblazing approved these changes Apr 14, 2026

View reviewed changes

rapids-bot bot merged commit 5e3a637 into NVIDIA:main Apr 14, 2026
11 checks passed

mbrobbel deleted the sync-gpu-to-host-conversion branch April 14, 2026 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync stream before destroying GPU representation in convert_to#103

Sync stream before destroying GPU representation in convert_to#103
rapids-bot[bot] merged 5 commits intoNVIDIA:mainfrom
mbrobbel:sync-gpu-to-host-conversion

mbrobbel commented Apr 13, 2026 •

edited

Loading

Uh oh!

dhruv9vats left a comment

Uh oh!

Uh oh!

dhruv9vats Apr 13, 2026 •

edited

Loading

Uh oh!

felipeblazing commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mbrobbel commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Local test results

How the tests work

Uh oh!

dhruv9vats left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dhruv9vats Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipeblazing commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbrobbel commented Apr 13, 2026 •

edited

Loading

dhruv9vats Apr 13, 2026 •

edited

Loading