Skip to content

perf: Reduce per-call overhead in representation converters#94

Draft
mbrobbel wants to merge 1 commit intoNVIDIA:mainfrom
mbrobbel:perf
Draft

perf: Reduce per-call overhead in representation converters#94
mbrobbel wants to merge 1 commit intoNVIDIA:mainfrom
mbrobbel:perf

Conversation

@mbrobbel
Copy link
Copy Markdown
Member

Summary

  • Use shared_mutex in converter registry so concurrent convert/has_converter calls no longer serialize on an exclusive lock
  • Remove redundant stream.synchronize() calls in cross-GPU and packed HOST→GPU converters where stream ordering already guarantees correctness
  • Eliminate unnecessary metadata vector copy in packed HOST→GPU converter by reading source metadata directly
  • Skip cudaGetDevice/cudaSetDevice round-trip in single-GPU scenarios for both packed and fast HOST→GPU converters

Benchmark results

Packed HOST→GPU path (back-to-back A/B, 5 repetitions):

Size Threads Baseline Optimized Change
64K 1 0.047ms 0.040ms 15% faster
256K 1 0.061ms 0.047ms 23% faster
1M 1 0.109ms 0.088ms 19% faster
64K 4 0.087ms 0.079ms 9% faster
256K 4 0.101ms 0.090ms 11% faster
1M 4 0.219ms 0.203ms 7% faster
512M any ~20ms ~20ms same (PCIe bound)

All other paths (GpuToHost packed, GpuToHostFast, HostFastToGpu) verified with reversed-order A/B tests showing no regressions.

Test plan

  • Existing unit tests pass (ctest)
  • Multi-threaded benchmark scenarios show no regressions
  • Consider adding a concurrent registry stress test (codex review warning)

🤖 Created by Claude

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- Use shared_mutex in converter registry so concurrent convert/has_converter
  calls no longer serialize on an exclusive lock
- Remove redundant stream.synchronize() before cudf::pack in cross-GPU converter
- Eliminate unnecessary metadata vector copy in packed HOST→GPU converter by
  reading source metadata directly (source outlives the unpack call)
- Remove redundant stream.synchronize() before cudf::unpack in packed HOST→GPU
  converter — stream ordering between H2D copies and table construction is
  guaranteed since both use the same stream
- Skip cudaGetDevice/cudaSetDevice round-trip in single-GPU scenarios for both
  packed and fast HOST→GPU converters
- Move BatchCopyAccumulator definition earlier in the file for reuse

Benchmarked improvement on packed HOST→GPU path (back-to-back A/B, 5 reps):
  256K/1T: 0.056ms → 0.044ms (21% faster)
  1M/1T:   0.095ms → 0.083ms (13% faster)
  64K/4T:  0.092ms → 0.077ms (16% faster)
  4M/4T:   0.726ms → 0.682ms (6% faster)
Large transfers (512M+) are PCIe bandwidth-limited and unchanged.

Created by Claude.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mbrobbel
Copy link
Copy Markdown
Member Author

/ok to test b42aba2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant