A high-level description of cuCascade's design, core components, and how they work together.
- Overview
- System Architecture
- Memory Tier System
- Core Components
- End-to-End Data Flow
- Thread Safety Model
- Design Patterns
- Key Source Files
cuCascade is a GPU memory management library for data-intensive applications that need to process more data than fits in GPU memory. It solves a fundamental problem: GPU memory is fast but limited, while host memory and disk are larger but slower.
Rather than failing when GPU memory runs out, cuCascade provides:
- Tiered memory -- seamlessly allocate across GPU, pinned host, and disk storage
- Reservation-based allocation -- prevent GPU oversubscription by reserving memory upfront
- Automatic data movement -- move data between tiers based on configurable pressure thresholds
- Hardware-aware placement -- discover NUMA topology and place memory optimally
- Safe concurrent access -- state machine and RAII handles protect data during multi-threaded processing
graph TB
subgraph "Application Layer"
APP[Application / Query Engine]
end
subgraph "Data Module"
DRM[data_repository_manager]
DR[data_repository]
DB[data_batch]
RC[representation_converter_registry]
GPU_REP[gpu_table_representation]
HOST_REP[host_data_representation / host_data_packed_representation]
end
subgraph "Memory Module"
MRM[memory_reservation_manager]
MS_GPU[memory_space GPU]
MS_HOST[memory_space HOST]
MS_DISK[memory_space DISK]
RARA[reservation_aware_resource_adaptor]
FSHMR[fixed_size_host_memory_resource]
DAL[disk_access_limiter]
TD[topology_discovery]
RMC[reservation_manager_configurator]
end
subgraph "Hardware"
GPU_MEM[GPU Memory]
HOST_MEM[Pinned Host Memory]
DISK_MEM[Disk Storage]
end
APP --> DRM
APP --> MRM
DRM --> DR
DR --> DB
DB --> GPU_REP
DB --> HOST_REP
DB --> RC
MRM --> MS_GPU
MRM --> MS_HOST
MRM --> MS_DISK
TD --> RMC
RMC --> MRM
MS_GPU --> RARA
MS_HOST --> FSHMR
MS_DISK --> DAL
RARA --> GPU_MEM
FSHMR --> HOST_MEM
DAL --> DISK_MEM
RC --> MS_GPU
RC --> MS_HOST
cuCascade organizes memory into three tiers, each with different performance and capacity characteristics:
| Tier | Backing | Speed | Capacity | Allocator |
|---|---|---|---|---|
| GPU | Device VRAM | Fastest | Limited (8-80 GB typical) | reservation_aware_resource_adaptor |
| HOST | Pinned CPU RAM | Medium | Large (64-512 GB typical) | fixed_size_host_memory_resource |
| DISK | NVMe/SSD | Slowest | Very large (TB+) | disk_access_limiter |
Each tier is identified by a memory_space_id -- a (Tier, device_id) pair. For example, GPU device 0 is (GPU, 0), NUMA node 1's host memory is (HOST, 1).
Data flows downward (GPU -> HOST -> DISK) when memory pressure is high (downgrade), and upward (DISK -> HOST -> GPU) when data is needed for processing (upgrade).
File: include/cucascade/memory/memory_reservation_manager.hpp
The memory_reservation_manager is the central coordinator for all memory operations. It owns all memory_space instances and provides a strategy-based interface for requesting memory reservations.
// Request 1GB on any available GPU
auto reservation = manager.request_reservation(
any_memory_space_in_tier(Tier::GPU), 1ULL << 30);
// Request on a specific host NUMA node
auto host_res = manager.request_reservation(
specific_memory_space(Tier::HOST, 0), 2ULL << 30);Reservation request strategies (Strategy pattern):
| Strategy | Behavior |
|---|---|
any_memory_space_in_tier(tier) |
Any space in the given tier |
any_memory_space_in_tier_with_preference(tier, device) |
Preferred device, falls back to others |
specific_memory_space(tier, device) |
Exact memory space, no fallback |
any_memory_space_in_tiers(tiers) |
Try tiers in preference order |
any_memory_space_to_downgrade(src, target_tier) |
Find a target for spilling data down |
any_memory_space_to_upgrade(src, target_tier) |
Find a target for promoting data up |
If no space can satisfy a request, the manager blocks on a condition variable until memory is freed.
File: include/cucascade/memory/memory_space.hpp
A memory_space represents a single memory location (e.g., one GPU, one NUMA node, one disk mount). It holds:
- A tier-specific allocator (the actual memory resource)
- A reservation allocator that tracks and enforces reservations
- A stream pool for CUDA stream management
- A notification channel for signaling when memory is freed
- Downgrade thresholds controlling when data should be moved to a slower tier
memory_space (GPU, device 0)
├── allocator: rmm::cuda_async_memory_resource
├── reservation_allocator: reservation_aware_resource_adaptor
├── stream_pool: exclusive_stream_pool (16 streams)
├── notification_channel: shared_ptr<notification_channel>
├── capacity: 8 GB
├── reservation_limit: 6.8 GB (85%)
├── downgrade_trigger: 6.8 GB (85%)
└── downgrade_stop: 5.2 GB (65%)
File: include/cucascade/memory/memory_reservation.hpp
Reservations guarantee that a certain amount of memory is available before any allocation happens. This prevents GPU oversubscription in multi-tenant scenarios.
sequenceDiagram
participant App
participant Manager as memory_reservation_manager
participant Space as memory_space
participant Adaptor as reservation_aware_resource_adaptor
App->>Manager: request_reservation(strategy, size)
Manager->>Space: make_reservation(size)
Space->>Adaptor: reserve(size)
Adaptor-->>Space: device_reserved_arena
Space-->>Manager: reservation
Manager-->>App: unique_ptr<reservation>
Note over App: Use reservation for allocations
App->>App: reservation goes out of scope
Note over Adaptor: Arena destructor frees reservation
Note over Space: notification_channel notifies waiters
Reservation limit policies control what happens when an allocation exceeds its stream's reservation:
| Policy | Behavior |
|---|---|
fail_reservation_limit_policy |
Throws rmm::out_of_memory (default) |
ignore_reservation_limit_policy |
Allows over-reservation (soft limits) |
increase_reservation_limit_policy |
Auto-grows reservation by 1.25x padding |
File: include/cucascade/data/data_batch.hpp
A data_batch is the fundamental unit of data in cuCascade. It wraps a tier-specific data representation (GPU table or host table) and manages concurrent access through a state machine.
stateDiagram-v2
direction LR
idle --> task_created : try_to_create_task()
idle --> in_transit : try_to_lock_for_in_transit()
task_created --> processing : try_to_lock_for_processing()
task_created --> in_transit : try_to_lock_for_in_transit()
task_created --> idle : try_to_cancel_task()
processing --> idle : handle destruction
processing --> task_created : try_to_create_task()
in_transit --> idle : try_to_release_in_transit()
in_transit --> task_created : try_to_release_in_transit(task_created)
States:
- idle -- no pending work, available for scheduling or tier movement
- task_created -- a task has been registered but processing hasn't started
- processing -- one or more RAII
data_batch_processing_handles are active - in_transit -- locked for movement between memory tiers (no concurrent access)
The data_batch_processing_handle uses RAII to ensure the processing count is always correctly decremented, even on exceptions.
File: include/cucascade/data/data_repository.hpp
A data_repository is a partitioned storage for data batches. It provides blocking pop operations that wait until a batch reaches the requested state.
// Pop a batch that can transition to task_created (blocks if none ready)
auto batch = repository.pop_data_batch(batch_state::task_created);
// Pop a specific batch by ID
auto batch = repository.pop_data_batch_by_id(42, batch_state::in_transit);The data_repository_manager coordinates multiple repositories across operators/pipelines and provides atomic batch ID generation.
File: include/cucascade/data/representation_converter.hpp
The representation_converter_registry provides type-indexed conversion between data representations. Converters are registered as functions keyed by (source_type, target_type).
Built-in converters:
| Source | Target | Method |
|---|---|---|
| GPU table | Host (direct) | cudaMemcpyBatchAsync (D2H) — copies column buffers directly; ~99% PCIe bandwidth |
| Host (direct) | GPU table | cudaMemcpyBatchAsync (H2D) — reconstructs cudf::column tree from metadata |
| GPU table | Host (packed) | cudf::pack() -> cudaMemcpyAsync (D2H) -> multi-block host allocation |
| Host (packed) | GPU table | cudaMemcpyAsync (H2D) -> cudf::unpack() on device |
| GPU table | GPU table | cudf::pack() -> cudaMemcpyPeerAsync -> cudf::unpack() (cross-device) |
| Host (packed) | Host (packed) | Block-by-block std::memcpy (cross-NUMA) |
"Host (direct)" = host_data_representation — preferred, no intermediate GPU allocation.
"Host (packed)" = host_data_packed_representation — uses cudf's pack/unpack serialization.
File: include/cucascade/memory/topology_discovery.hpp
The topology_discovery class detects the hardware layout at runtime using NVML (loaded dynamically via dlopen) and the /sys filesystem. It discovers:
- GPU devices with PCIe bus IDs and UUIDs
- NUMA node assignments for each GPU
- CPU core affinity per GPU
- Network devices (InfiniBand/RoCE NICs) with NUMA affinity
- Storage devices (NVMe, SATA SSD/HDD) with NUMA affinity
- PCIe path types between devices (PIX, PXB, PHB, NODE, SYS)
This information feeds into reservation_manager_configurator to automatically bind host memory spaces to the correct NUMA nodes for each GPU.
A typical lifecycle of data through cuCascade:
1. INGESTION
Create data representation (e.g., gpu_table_representation wrapping a cuDF table)
-> Wrap in data_batch with unique ID from data_repository_manager
-> Add to data_repository
2. TASK SCHEDULING
batch.try_to_create_task() [idle -> task_created]
repository.pop_data_batch(task_created) [retrieve batch for processing]
3. PROCESSING
batch.try_to_lock_for_processing() [task_created -> processing]
-> Returns data_batch_processing_handle (RAII)
-> Access data via batch.get_data()
-> Handle destructs on scope exit [processing -> idle]
4. MEMORY PRESSURE (downgrade)
memory_space.should_downgrade_memory() [threshold exceeded]
batch.try_to_lock_for_in_transit() [idle -> in_transit]
converter_registry.convert<host_data_representation>(...)
batch.set_data(new_representation) [data now on HOST]
batch.try_to_release_in_transit() [in_transit -> idle]
5. DATA NEEDED (upgrade)
Same flow as downgrade but in reverse tier direction
6. CLEANUP
manager.clear_all_repositories() [verify all batches consumed]
cuCascade uses a strict lock hierarchy to prevent deadlocks:
Level 1: atomic<uint64_t> (batch ID generation -- lock-free)
|
Level 2: data_batch._mutex (protects state machine)
|
Level 3: idata_repository._mutex (protects batch storage)
|
Level 4: data_repository_manager._mutex (protects repository map)
|
Level 5: representation_converter_registry._mutex (protects converter map)
|
Level 6: memory_reservation_manager._wait_mutex (protects reservation waiting)
Key synchronization primitives:
std::mutex-- guards state transitions, storage, and configurationstd::condition_variable-- blocks repository pops and reservation requests until satisfiedstd::atomic-- lock-free counters for batch IDs, allocated bytes, and peak trackingatomic_bounded_counter-- CAS-based bounded arithmetic for reservation enforcementnotification_channel-- signals waiting threads when reservations are released
| Pattern | Where Used |
|---|---|
| Strategy | reservation_request_strategy subclasses for memory selection |
| Builder | reservation_manager_configurator for fluent system configuration |
| RAII | data_batch_processing_handle, borrowed_stream, multiple_blocks_allocation, notify_on_exit |
| Factory | DeviceMemoryResourceFactoryFn for tier-specific allocator creation |
| Adapter | reservation_aware_resource_adaptor wraps RMM resources with tracking |
| State Machine | data_batch with explicit states and guarded transitions |
| Observer | notification_channel / event_notifier for memory release signaling |
| Type-Indexed Registry | representation_converter_registry keyed by (source_type, target_type) |
| Variant | memory_space_config for tier-specific configuration, reserving_adaptor_type for allocators |
| File | Purpose |
|---|---|
include/cucascade/memory/common.hpp |
Tier enum, memory_space_id, factory functions |
include/cucascade/memory/config.hpp |
Configuration structs for GPU, HOST, DISK |
include/cucascade/memory/memory_reservation_manager.hpp |
Central reservation coordinator and strategies |
include/cucascade/memory/memory_space.hpp |
Per-location memory container |
include/cucascade/memory/memory_reservation.hpp |
Reservation objects, arenas, and limit policies |
include/cucascade/memory/reservation_aware_resource_adaptor.hpp |
GPU allocator with per-stream tracking |
include/cucascade/memory/fixed_size_host_memory_resource.hpp |
Block-based pinned host allocator |
include/cucascade/memory/disk_access_limiter.hpp |
Disk tier reservation tracker |
include/cucascade/memory/topology_discovery.hpp |
NVML-based hardware topology detection |
include/cucascade/memory/reservation_manager_configurator.hpp |
Builder for system configuration |
include/cucascade/memory/notification_channel.hpp |
Cross-reservation signaling |
include/cucascade/memory/stream_pool.hpp |
CUDA stream pool with RAII borrowing |
include/cucascade/memory/oom_handling_policy.hpp |
OOM handling strategies |
include/cucascade/memory/error.hpp |
Custom error types and exceptions |
include/cucascade/memory/numa_region_pinned_host_allocator.hpp |
NUMA-aware pinned host allocation |
include/cucascade/memory/host_table.hpp |
host_table_allocation + column_metadata for direct-copy host representations |
include/cucascade/memory/host_table_packed.hpp |
host_table_packed_allocation for packed (cudf::pack) host representations |
include/cucascade/memory/null_device_memory_resource.hpp |
No-op resource for disk tier |
| File | Purpose |
|---|---|
include/cucascade/data/common.hpp |
idata_representation interface |
include/cucascade/data/data_batch.hpp |
Batch lifecycle, state machine, processing handles |
include/cucascade/data/data_repository.hpp |
Partitioned batch storage with blocking pops |
include/cucascade/data/data_repository_manager.hpp |
Multi-pipeline repository coordination |
include/cucascade/data/representation_converter.hpp |
Type-indexed converter registry |
include/cucascade/data/gpu_data_representation.hpp |
GPU-resident cuDF table wrapper |
include/cucascade/data/cpu_data_representation.hpp |
host_data_representation (direct buffer copy) and host_data_packed_representation (cudf::pack) |
include/cucascade/utils/atomics.hpp |
atomic_peak_tracker, atomic_bounded_counter |
include/cucascade/utils/overloaded.hpp |
Variant visitor helper |