Add cuda_buffer_backend and torch_buffer_backend for rosidl::Buffer#1
Add cuda_buffer_backend and torch_buffer_backend for rosidl::Buffer#1
Conversation
| { | ||
| (void)endpoint_info; | ||
| (void)existing_endpoints; | ||
| (void)endpoint_supported_backends; |
There was a problem hiding this comment.
check if backend exists in endpoint_supported_backends ?
There was a problem hiding this comment.
Thanks for the catch, on_discovering_endpoint now checks whether torch is present in endpoint_supported_backends and returns false if not, which let the RMW to fall back to the default CPU serialization path for that endpoint.
| find_package(cuda_buffer_backend_msgs REQUIRED) | ||
| find_package(rmw REQUIRED) | ||
| find_package(rcutils REQUIRED) | ||
| find_package(CUDAToolkit REQUIRED) |
There was a problem hiding this comment.
we need to include this dependency in the package.xml, the requirement it's probrably nvcc ?
is this key enough https://github.com/ros/rosdistro/blob/master/rosdep/base.yaml#L8367C1-L8367C12 ?
There was a problem hiding this comment.
Thanks for the pointer -- added nvidia-cuda to the dependency list, which should provide everything find_package(CUDAToolkit) needs.
| set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_CXX_COMPILER}") | ||
| endif() | ||
|
|
||
| find_package(Torch REQUIRED) |
There was a problem hiding this comment.
Not sure about this one. is there a ubuntu package fot this one ?
wget https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
unzip libtorch-shared-with-deps-latest.zip
then I used -DCMAKE_PREFIX_PATH=<path to pytorch>
There was a problem hiding this comment.
I finally used version 11.8, but nvcc is 12.0 in ubuntu
https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.7.1%2Bcu118.zip
Maybe we should think if it's worth it to add a vendor package to install libtorch
There was a problem hiding this comment.
Agreed, that would be very helpful. A libtorch_vendor package would simplify the setup and remove the CMAKE_PREFIX_PATH hack.
Using the cu118 build with a CUDA 12.x runtime should be fine. I think a newer driver is always backward-compatible with older CUDA applications."
ahcorde
left a comment
There was a problem hiding this comment.
I also detected some linters failures
and this is not passing
test_cuda_image_cpu_fallback_fastrtps_launch with this error:
6: FAIL: test_cpu_fallback_paths (cuda_buffer_backend.TestCudaImageCpuFallbackFastRTPS.test_cpu_fallback_paths)
6: Test all CPU fallback paths and normal IPC simultaneously over FastRTPS.
6: ----------------------------------------------------------------------
6: Traceback (most recent call last):
6: File "/tmp/ws/src/rosidl_buffer_backends/cuda_buffer_backend/cuda_buffer_backend/test/test_cuda_image_cpu_fallback_fastrtps_launch.py", line 203, in test_cpu_fallback_paths
6: self.assertTrue(
6: AssertionError: False is not true : Cross-device fallback validation failed (expected backend="cpu")Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>
Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>
…uffer_api.hpp Co-authored-by: Alejandro Hernández Cordero <ahcorde@gmail.com> Signed-off-by: yuanknv <113960800+yuanknv@users.noreply.github.com>
Thanks for the review and feedback! I've fixed the linter errors in |
ahcorde
left a comment
There was a problem hiding this comment.
I'm getting some errors when compiling the code
- I have to set up the g++ and gcc version on torch_buffer
set(CMAKE_C_COMPILER gcc-12)
set(CMAKE_CXX_COMPILER g++-12)- I'm getting this link error:
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaGraphAddDependencies_v2@libcudart.so.12'
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaStreamGetCaptureInfo_v3@libcudart.so.12'
/usr/bin/ld: /home/ahcorde/buffer_backends/install/opt/libtorch_vendor/lib/libtorch_cuda.so: undefined reference to `cudaStreamUpdateCaptureDependencies_v2@libcudart.so.12'
Thanks for testing ! Both issues are fixed. For the undefined references: the For the g++-12 pin: that was the CUDA toolkit's host_config.h rejecting a newer host GCC. (CUDA 11.8 → max GCC 11, 12.0–12.3 → GCC 12) Since |
Description
This pull request adds CUDA and PyTorch buffer backend implementations for the rosidl::Buffer, enabling zero-copy GPU memory sharing between ROS 2 publishers and subscribers .
CUDA buffer backend: Enables zero-copy GPU data transport with fully asynchronous - data could stay on the GPU accorss ROS nodes.
allocate_msgallocates from a CUDA Virtual Memory Management (VMM) based IPC memory pool; each block carries a pre-exported POSIX FD for zero-overhead IPC reuse.from_bufferreturns aWriteHandle/ReadHandlethat manages GPU stream ordering via CUDA events (nocudaStreamSynchronizein the pipeline). On transmit, the plugin checks locality via a shared-memory endpoint registry: for same-host same-GPU peers, it sends the block's FD over a Unix socket and an IPC event handle for cross-process GPU sync; otherwise it falls back to CPU serialization. On receive, the block is imported and mapped (cached per source block), with a shared-memory refcount and UID validation to prevent stale reuse. A background recycler thread handles event synchronization and block reclamation off the callback thread.Torch Buffer Backend: A device-agnostic layer on top of device buffer backends (e.g. cuda_buffer_backend) that lets users work with torch::Tensor directly.
allocate_msgcreates aTorchBufferImplwrapping a buffer with tensor metadata (shape, strides, dtype); the device is auto-detected at compile time - if no accelerated buffer backend is installed, falls back to CPU.from_bufferreturns a torch::Tensor view backed by the device buffer's handle (write or read, captured in the tensor deleter for event lifetime safety).to_buffercopies a pre-existing torch tensor into the allocated buffer. On transmit, the TorchBufferDescriptor carries tensor metadata alongside a nested device_data field that RMW serializes via whichever device backend plugin is registered.This pull request consists of the following key components:
cuda_buffer:Core CUDA buffer library providing a VMM-backed CUDA IPC memory pool, a host endpoint manager for locality discovery over shared memory, and user-facing allocate_msg/from_buffer/to_buffer APIs with RAII CUDA event based GPU synchronization (ReadHandle/WriteHandle).cuda_buffer_backend:BufferBackend plugin registered via pluginlib. Handles endpoint discovery, CudaBufferDescriptor serialization with VMM IPC handles, IPC refcount lifecycle, and automatic CPU fallback when CUDA IPC is unavailable.cuda_buffer_backend_msgs:ROS 2 message definition for CudaBufferDescriptor.torch_buffer:PyTorch buffer library wrapping device buffers with tensor metadata (shape, strides, dtype). Provides allocate_msg/from_buffer/to_buffer APIs that auto-detect device backend at compile time.torch_buffer_backend:BufferBackend plugin for PyTorch tensors. Handles TorchBufferDescriptor serialization with nested device buffer delegation.torch_buffer_backend_msgs:ROS 2 message definition for TorchBufferDescriptor.Is this user-facing behavior change?
No.
Did you use Generative AI?
Yes. Claude (claude-4.6-opus) via Cursor was used to assist with creating an initial prototype version of the changes contained in this PR.
Additional Information
This PR is part of the broader ROS 2 native buffer feature introduced in this post.