Faster shuffle batched random (based on: Nevin Brackett-Rozinsky, Daniel Lemire work)#6119
Conversation
Implements the algorithm from Brackett-Rozinsky & Lemire's paper "Batched Ranged Random Integer Generation" to reduce RNG calls in std::shuffle and ranges::shuffle for 64-bit URNGs like mt19937_64. The batched approach extracts multiple bounded random integers from a single 64-bit random word, using only multiplication (no division) in the common case. This reduces the number of RNG calls by approximately half for arrays with fewer than 2^32 elements. Resolves microsoft#5736
Move _Has_full_64bit_range check out of _Batched_rng_from_urng class to reduce template instantiation overhead. Use is_same_v and _Max_limit as suggested by reviewer.
…n/faster_shuffle_batched_random
Fix two bugs in the batched shuffle implementation from PR microsoft#5932: 1. `_Unsigned128` brace-initialization zeroed the high word of the multiplication result (the random index), making all indices 0 and shuffle deterministic. Replace with separate `uint64_t` variables for high and low words in both `_Single_bounded` and `_Batch_2`. 2. Loop advancement in `_Random_shuffle_batched` and `_Shuffle_unchecked_batched` had extra `++_UTarget` increments causing elements to be skipped. Restructure to `for` loops matching the original `_Random_shuffle1` pattern. Add regression test for microsoftGH-6112 and shuffle quality tests adapted from Lemire's cpp_batched_random test suite (uniformity, coverage, pair distribution at start/end positions). Co-Authored-By: Claude <noreply@anthropic.com>
74fda0f to
86c0c0a
Compare
I'm sorry, but this is not acceptable from a licensing perspective, and we have to be really strict about this. If his code isn't licensed under Apache 2 + LLVM (or Boost), we can't accept derived works from it. (And if it was, you'd still need to preserve his copyright.) |
This was his own suggestion after checking the issue that happened with the previous version of this PR. I've sent him this pull request so he can confirm (and maybe add a license to his batched random codebase). |
|
If the upstream codebase is updated to Apache 2 + LLVM Exception, and the derived work here retains his copyright notice, then that will be acceptable. We cannot accept the MIT license. |
Done. (I did suggest using my tests. Why not? ) |
|
Thanks Daniel! 😻 I know you're cool, we just have to be consistently strict about licensing and derived works, so that the codebase is never in question. I'll review this hopefully in the near future (still trying to get through my post- |
|
@StephanTLavavej I fully understand. In any case, the base library is fully available and I have verified that it provides significant betters (under Visual Studio): https://github.com/lemire/cpp_batched_random.git The reason why batching helps is also rather intuitive. :-) |
Align the batched random shuffle with Daniel Lemire's reference implementation (github.com/lemire/cpp_batched_random). The previous version only used batch-of-2 and did not use the threshold constants in loop conditions. This adopts the full cascade (batch 1-6) with backward Fisher-Yates iteration, matching Lemire's design exactly.
|
Here are some numbers from a quick benchmark on compiler-explorer, using the current implementation. Speed-up is consistent across different sizes. I was missing a few constants from Daniel's original implementation, but that is solved now. |
|
Here are the speedups we are talking about... So it is considerable. More than 5x. Even accounting for the unavoidable variations and margins of errors, this is quite good. |
There was a problem hiding this comment.
Pull request overview
Reintroduces and fixes the batched random integer generation optimization for std::shuffle / std::ranges::shuffle (based on Brackett‑Rozinsky & Lemire), addressing the regressions described in #6112 and adding regression/quality tests to prevent recurrence.
Changes:
- Add a batched shuffle implementation for full-range 64-bit URNGs and dispatch to it from
std::shuffleandstd::ranges::shuffle. - Implement a
_Batched_rng_from_urnghelper that extracts multiple bounded indices from one 64-bit random word using 128-bit multiplication and rejection sampling. - Add comprehensive regression and distribution-quality tests for both
std::shuffleandranges::shufflebatched paths.
Show a summary per file
| File | Description |
|---|---|
stl/inc/algorithm |
Adds batched RNG + batched shuffle implementations and dispatches to them for full-range 64-bit URNGs in both std::shuffle and std::ranges::shuffle. |
tests/std/tests/P0896R4_ranges_alg_shuffle/test.cpp |
Adds regression and statistical/coverage tests that exercise both the batched (mt19937_64) and non-batched paths. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (1)
stl/inc/algorithm:6724
- Same self-swap issue as std::shuffle’s batched path: iter_swap is unconditionally invoked even when the selected offset refers to the same element being placed. The non-batched implementation skips this to avoid self-move-assignment. Add a guard to avoid calling iter_swap when the two iterators are equal (and apply consistently to the other iter_swap calls in this batched loop).
for (; _Remaining > (uint64_t{1} << 30); --_Remaining) {
_BatchedRng._Batch(_Offsets, _Remaining, 1, _Remaining);
_RANGES iter_swap(
_First + static_cast<_Diff>(_Remaining - 1), _First + static_cast<_Diff>(_Offsets[0]));
}
- Files reviewed: 2/2 changed files
- Comments generated: 1
Addresses Copilot review on PR microsoft#6119: the batched shuffle paths unconditionally invoked swap()/iter_swap() even when the offset selected the same element, which can call self-move-assignment on user-defined types. The non-batched _Random_shuffle1 already guards against this; mirror that for the batched _Random_shuffle_batched and _Shuffle_unchecked_batched.
Summary
Fixes 2 bugs that caused
shuffle()to be critically damaged by #5932 (reverted in #6113) and adds comprehensive regression tests. Resolves #6112Bug 1: _Unsigned128 initialization overwrites the random index
The core of the algorithm proposed in https://arxiv.org/abs/2408.06213 is the 128-bit multiplication
bound * random_word, which produces a high word (the random index) and a low word (the leftover for rejection sampling). The reverted code used a self-referential initialization pattern to capture both words:_Unsigned128 _Product{ _Base128::_UMul128(random_word, bound, _Product._Word[1])}; auto _Leftover = _Product._Word[0]; // ... use _Product._Word[1] as the random index ..._UMul128returns the low 64 bits and writes the high 64 bits (the random index) to the out-parameter_Product._Word[1]. But the_Base128constructor(__msvc_int128.hpp:284)uses aggregate initialization:constexpr _Base128(const _Ty _Val) noexcept : _Word{static_cast<uint64_t>(_Val)} {}_Word{val}initializes_Word[0] = valand zero-initializes_Word[1], overwriting the high word that_UMul128just wrote. The random index was always 0, every element got swapped with position 0, making theshuffle()behavior deterministic.Repro here: https://godbolt.org/z/4YojEf3zf
Fix
Replace the _Unsigned128 pattern with separate uint64_t variables:
Bug 2: Loop double-advancement skips elements
The batched shuffle loops advanced the iterator at both the top and bottom of each iteration:
The original non_batched
_Random_shuffle1uses a single-advanceforloop:for (; ++_UTarget != _ULast; ++_Target_index) { ... }Fix
Restructure both _Random_shuffle_batched and _Shuffle_unchecked_batched to match the original for loop pattern. The batch-of-2 path does one extra ++_UTarget / ++_Target_index inside its body (for the second element), then continues to let the for header handle the next advance.
New tests added
<algorithm>:shuffle()was critically damaged by the attempt to implement Batched Ranged Random Integer Generation #6112: two consecutive shuffles with the same (non-reseeded) mt19937_64 must produce different results. Tests both std::shuffle and ranges::shuffle.I shamelessly copied the last 4 tests from Daniel's batched_random repo test suite