Skip to content

Sync stream before destroying GPU representation in convert_to#103

Merged
rapids-bot[bot] merged 5 commits intoNVIDIA:mainfrom
mbrobbel:sync-gpu-to-host-conversion
Apr 14, 2026
Merged

Sync stream before destroying GPU representation in convert_to#103
rapids-bot[bot] merged 5 commits intoNVIDIA:mainfrom
mbrobbel:sync-gpu-to-host-conversion

Conversation

@mbrobbel
Copy link
Copy Markdown
Member

@mbrobbel mbrobbel commented Apr 13, 2026

Summary

  • Conversions from GPU in data_batch::convert_to may enqueue async operations on the provided CUDA stream that read from the source GPU memory. Previously, the source GPU representation was destroyed immediately after reassigning _data, risking use-after-free or leaking buffers into process-lifetime static storage.
  • Synchronize the stream before the old GPU representation is destroyed to ensure async operations complete.
  • Use unique_lock instead of lock_guard to release the mutex before synchronizing, avoiding blocking other threads during the stream sync.
  • The sync condition triggers for any conversion from a GPU source, not just GPU→HOST — this covers future converters (e.g. GPU→DISK) and custom converters that may enqueue async reads without synchronizing internally.

Test plan

  • New test: stream is synchronized before old GPU representation is destroyed
    • Passes with fix, fails without fix (synced_before_destroy = false)
  • New test: mutex is released before stream sync, allowing concurrent access
    • Passes with early unlock, fails when sync is inside the lock (accessed_during_sync = false)
    • Uses cudaLaunchHostFunc callback (50 ms sleep) for a deterministic window independent of GPU speed
  • Without any fix: first test fails (no sync), second test passes (no lock contention since no sync)
  • Verify full test suite passes

Local test results

Configuration Test 1 (sync before destroy) Test 2 (mutex released during sync)
With fix (early unlock + sync) ✅ Pass ✅ Pass
No fix (original code) ❌ Fail — no sync at all ✅ Pass — no lock contention since no sync
Sync inside lock (naive fix) ✅ Pass — sync happens ❌ Fail — mutex held during 50 ms sync

Only the correct implementation (early unlock + sync) passes both tests.

Raw test output

With fix (early unlock + sync):

Filters: [convert_to]
===============================================================================
All tests passed (3 assertions in 2 test cases)

Without fix (original code, no sync at all):

/home/matthijs/sirius/cucascade/test/data/test_data_batch.cpp:1753: FAILED:
  REQUIRE( observer.synced_before_destroy )
with expansion:
  false

test cases: 2 | 1 passed | 1 failed
assertions: 3 | 2 passed | 1 failed

With sync inside lock (correct sync, but naive lock_guard):

/home/matthijs/sirius/cucascade/test/data/test_data_batch.cpp:1849: FAILED:
  REQUIRE( accessed_during_sync )
with expansion:
  false

test cases: 2 | 1 passed | 1 failed
assertions: 2 | 1 passed | 1 failed

How the tests work

Test 1 — sync before destroy: Registers a custom converter that enqueues a 4 MB async cudaMemcpyDeviceToHost into pinned memory and records a CUDA event, but does not synchronize. A custom observed_gpu_representation checks cudaEventQuery in its destructor. Without the sync, the event is still pending at destruction time.

Test 2 — mutex released before sync: Same converter setup, plus a cudaLaunchHostFunc callback that sleeps for 50 ms on the stream (creating a deterministic delay independent of GPU speed). A second thread calls get_state() (which acquires the mutex) and checks the event. With the early unlock, get_state() returns immediately while the 50 ms callback is still running. With the sync inside the lock, get_state() is blocked for the full 50 ms.

🤖 Created by Claude

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

GPU->HOST conversions may enqueue async copies on the provided stream.
Synchronize the stream before the source GPU representation is destroyed
to avoid reading freed memory or leaking buffers into process-lifetime
static storage.

Uses unique_lock to release the mutex before synchronizing, so other
threads aren't blocked during the stream sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mbrobbel and others added 2 commits April 13, 2026 12:35
Adds a test that verifies convert_to synchronizes the stream before
destroying the old GPU representation during GPU-to-HOST conversion.

The test uses a custom converter that deliberately does not synchronize
after async D2H copies, and an observed GPU representation whose
destructor checks whether the CUDA event (recorded after the async copy)
was already complete at destruction time.

Without the stream sync in convert_to, the old representation is
destroyed while the async copy is still in-flight, and the event query
in the destructor reports incomplete.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that convert_to releases the mutex before synchronizing the
stream on GPU-to-HOST conversions, allowing other threads to access the
batch during the sync.

Uses a cudaLaunchHostFunc callback that sleeps for 50 ms to create a
deterministic window. A second thread calls get_state() (which takes
the mutex) and checks whether the CUDA event is still pending, proving
the mutex was released before the stream sync completed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@dhruv9vats dhruv9vats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

Sync the stream whenever the old representation is on GPU, not just for
GPU-to-HOST conversions. Future converters (e.g. GPU-to-DISK) or custom
converters may also enqueue async reads from GPU source memory.

Updated test names and comments to reflect the broader condition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_data = std::move(new_representation);

bool needs_sync =
old_representation != nullptr && old_representation->get_current_tier() == memory::Tier::GPU;
Copy link
Copy Markdown
Member

@dhruv9vats dhruv9vats Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not want to cover the to GPU memory case too?

Conversions to GPU (e.g. HOST→GPU) also enqueue async copies that read
from the source buffer. Sync the stream whenever either side is GPU to
prevent use-after-free when the old representation is destroyed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@felipeblazing
Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 5e3a637 into NVIDIA:main Apr 14, 2026
11 checks passed
@mbrobbel mbrobbel deleted the sync-gpu-to-host-conversion branch April 14, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants