Sync stream before destroying GPU representation in convert_to#103
Merged
rapids-bot[bot] merged 5 commits intoNVIDIA:mainfrom Apr 14, 2026
Merged
Sync stream before destroying GPU representation in convert_to#103rapids-bot[bot] merged 5 commits intoNVIDIA:mainfrom
rapids-bot[bot] merged 5 commits intoNVIDIA:mainfrom
Conversation
GPU->HOST conversions may enqueue async copies on the provided stream. Synchronize the stream before the source GPU representation is destroyed to avoid reading freed memory or leaking buffers into process-lifetime static storage. Uses unique_lock to release the mutex before synchronizing, so other threads aren't blocked during the stream sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6b423cc to
cd6a553
Compare
Adds a test that verifies convert_to synchronizes the stream before destroying the old GPU representation during GPU-to-HOST conversion. The test uses a custom converter that deliberately does not synchronize after async D2H copies, and an observed GPU representation whose destructor checks whether the CUDA event (recorded after the async copy) was already complete at destruction time. Without the stream sync in convert_to, the old representation is destroyed while the async copy is still in-flight, and the event query in the destructor reports incomplete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that convert_to releases the mutex before synchronizing the stream on GPU-to-HOST conversions, allowing other threads to access the batch during the sync. Uses a cudaLaunchHostFunc callback that sleeps for 50 ms to create a deterministic window. A second thread calls get_state() (which takes the mutex) and checks whether the CUDA event is still pending, proving the mutex was released before the stream sync completed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sync the stream whenever the old representation is on GPU, not just for GPU-to-HOST conversions. Future converters (e.g. GPU-to-DISK) or custom converters may also enqueue async reads from GPU source memory. Updated test names and comments to reflect the broader condition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dhruv9vats
reviewed
Apr 14, 2026
| _data = std::move(new_representation); | ||
|
|
||
| bool needs_sync = | ||
| old_representation != nullptr && old_representation->get_current_tier() == memory::Tier::GPU; |
Member
There was a problem hiding this comment.
do we not want to cover the to GPU memory case too?
Conversions to GPU (e.g. HOST→GPU) also enqueue async copies that read from the source buffer. Sync the stream whenever either side is GPU to prevent use-after-free when the old representation is destroyed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
felipeblazing
approved these changes
Apr 14, 2026
Contributor
|
/merge |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
data_batch::convert_tomay enqueue async operations on the provided CUDA stream that read from the source GPU memory. Previously, the source GPU representation was destroyed immediately after reassigning_data, risking use-after-free or leaking buffers into process-lifetime static storage.unique_lockinstead oflock_guardto release the mutex before synchronizing, avoiding blocking other threads during the stream sync.Test plan
synced_before_destroy= false)accessed_during_sync= false)cudaLaunchHostFunccallback (50 ms sleep) for a deterministic window independent of GPU speedLocal test results
Only the correct implementation (early unlock + sync) passes both tests.
Raw test output
With fix (early unlock + sync):
Without fix (original code, no sync at all):
With sync inside lock (correct sync, but naive lock_guard):
How the tests work
Test 1 — sync before destroy: Registers a custom converter that enqueues a 4 MB async
cudaMemcpyDeviceToHostinto pinned memory and records a CUDA event, but does not synchronize. A customobserved_gpu_representationcheckscudaEventQueryin its destructor. Without the sync, the event is still pending at destruction time.Test 2 — mutex released before sync: Same converter setup, plus a
cudaLaunchHostFunccallback that sleeps for 50 ms on the stream (creating a deterministic delay independent of GPU speed). A second thread callsget_state()(which acquires the mutex) and checks the event. With the early unlock,get_state()returns immediately while the 50 ms callback is still running. With the sync inside the lock,get_state()is blocked for the full 50 ms.🤖 Created by Claude
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com