perf: Reduce per-call overhead in representation converters by mbrobbel · Pull Request #94 · NVIDIA/cuCascade

mbrobbel · 2026-03-30T17:23:42Z

Summary

Use shared_mutex in converter registry so concurrent convert/has_converter calls no longer serialize on an exclusive lock
Remove redundant stream.synchronize() calls in cross-GPU and packed HOST→GPU converters where stream ordering already guarantees correctness
Eliminate unnecessary metadata vector copy in packed HOST→GPU converter by reading source metadata directly
Skip cudaGetDevice/cudaSetDevice round-trip in single-GPU scenarios for both packed and fast HOST→GPU converters

Benchmark results

Packed HOST→GPU path (back-to-back A/B, 5 repetitions):

Size	Threads	Baseline	Optimized	Change
64K	1	0.047ms	0.040ms	15% faster
256K	1	0.061ms	0.047ms	23% faster
1M	1	0.109ms	0.088ms	19% faster
64K	4	0.087ms	0.079ms	9% faster
256K	4	0.101ms	0.090ms	11% faster
1M	4	0.219ms	0.203ms	7% faster
512M	any	~20ms	~20ms	same (PCIe bound)

All other paths (GpuToHost packed, GpuToHostFast, HostFastToGpu) verified with reversed-order A/B tests showing no regressions.

Test plan

Existing unit tests pass (ctest)
Multi-threaded benchmark scenarios show no regressions
Consider adding a concurrent registry stress test (codex review warning)

🤖 Created by Claude

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- Use shared_mutex in converter registry so concurrent convert/has_converter calls no longer serialize on an exclusive lock - Remove redundant stream.synchronize() before cudf::pack in cross-GPU converter - Eliminate unnecessary metadata vector copy in packed HOST→GPU converter by reading source metadata directly (source outlives the unpack call) - Remove redundant stream.synchronize() before cudf::unpack in packed HOST→GPU converter — stream ordering between H2D copies and table construction is guaranteed since both use the same stream - Skip cudaGetDevice/cudaSetDevice round-trip in single-GPU scenarios for both packed and fast HOST→GPU converters - Move BatchCopyAccumulator definition earlier in the file for reuse Benchmarked improvement on packed HOST→GPU path (back-to-back A/B, 5 reps): 256K/1T: 0.056ms → 0.044ms (21% faster) 1M/1T: 0.095ms → 0.083ms (13% faster) 64K/4T: 0.092ms → 0.077ms (16% faster) 4M/4T: 0.726ms → 0.682ms (6% faster) Large transfers (512M+) are PCIe bandwidth-limited and unchanged. Created by Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-03-30T17:23:47Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mbrobbel · 2026-03-30T17:29:59Z

/ok to test b42aba2

mbrobbel added non-breaking improvement labels Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Reduce per-call overhead in representation converters#94

perf: Reduce per-call overhead in representation converters#94
mbrobbel wants to merge 1 commit intoNVIDIA:mainfrom
mbrobbel:perf

mbrobbel commented Mar 30, 2026

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

mbrobbel commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbrobbel commented Mar 30, 2026

Summary

Benchmark results

Test plan

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

mbrobbel commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant