Faster shuffle batched random (based on: Nevin Brackett-Rozinsky, Daniel Lemire work) by FranciscoThiesen · Pull Request #6119 · microsoft/STL

FranciscoThiesen · 2026-02-28T01:03:16Z

Summary

Fixes 2 bugs that caused shuffle() to be critically damaged by #5932 (reverted in #6113) and adds comprehensive regression tests. Resolves #6112

Bug 1: _Unsigned128 initialization overwrites the random index

The core of the algorithm proposed in https://arxiv.org/abs/2408.06213 is the 128-bit multiplication bound * random_word, which produces a high word (the random index) and a low word (the leftover for rejection sampling). The reverted code used a self-referential initialization pattern to capture both words:

_Unsigned128 _Product{
    _Base128::_UMul128(random_word, bound, _Product._Word[1])};
auto _Leftover = _Product._Word[0];
// ... use _Product._Word[1] as the random index ...

_UMul128 returns the low 64 bits and writes the high 64 bits (the random index) to the out-parameter _Product._Word[1]. But the _Base128 constructor (__msvc_int128.hpp:284) uses aggregate initialization:

constexpr _Base128(const _Ty _Val) noexcept : _Word{static_cast<uint64_t>(_Val)} {}

_Word{val} initializes _Word[0] = val and zero-initializes _Word[1], overwriting the high word that _UMul128 just wrote. The random index was always 0, every element got swapped with position 0, making the shuffle() behavior deterministic.

Repro here: https://godbolt.org/z/4YojEf3zf

Fix

Replace the _Unsigned128 pattern with separate uint64_t variables:

uint64_t _High;
uint64_t _Leftover = _Base128::_UMul128(random_word, bound, _High);
// _High is now the correct random index

Bug 2: Loop double-advancement skips elements

The batched shuffle loops advanced the iterator at both the top and bottom of each iteration:

while (_UTarget != _ULast) {
    ++_UTarget;                    // advance here...
    if (_UTarget == _ULast) break;
    // ... swap logic (batch or single) ...
    ++_UTarget;                    // ...and again here
    ++_Target_index;
}

The original non_batched _Random_shuffle1 uses a single-advance for loop:

for (; ++_UTarget != _ULast; ++_Target_index) { ... }

Fix

Restructure both _Random_shuffle_batched and _Shuffle_unchecked_batched to match the original for loop pattern. The batch-of-2 path does one extra ++_UTarget / ++_Target_index inside its body (for the second element), then continues to let the for header handle the next advance.

New tests added

test_gh_6112 — Regression test from <algorithm>: shuffle() was critically damaged by the attempt to implement Batched Ranged Random Integer Generation #6112: two consecutive shuffles with the same (non-reseeded) mt19937_64 must produce different results. Tests both std::shuffle and ranges::shuffle.
test_everyone_can_move_everywhere — Every value must appear at every position over 4096 trials (array size 64). Tests both std::shuffle and ranges::shuffle.
test_uniformity — Distribution of values across positions must be roughly uniform (relative gap < 0.6, 16384 trials, size 32).
test_any_possible_pair_at_start — Every ordered pair (i, j) with i != j must appear at positions [0], [1].
test_any_possible_pair_at_end — Every ordered pair must appear at positions [size-2], [size-1].

I shamelessly copied the last 4 tests from Daniel's batched_random repo test suite

Implements the algorithm from Brackett-Rozinsky & Lemire's paper "Batched Ranged Random Integer Generation" to reduce RNG calls in std::shuffle and ranges::shuffle for 64-bit URNGs like mt19937_64. The batched approach extracts multiple bounded random integers from a single 64-bit random word, using only multiplication (no division) in the common case. This reduces the number of RNG calls by approximately half for arrays with fewer than 2^32 elements. Resolves microsoft#5736

Move _Has_full_64bit_range check out of _Batched_rng_from_urng class to reduce template instantiation overhead. Use is_same_v and _Max_limit as suggested by reviewer.

…n/faster_shuffle_batched_random

Fix two bugs in the batched shuffle implementation from PR microsoft#5932: 1. `_Unsigned128` brace-initialization zeroed the high word of the multiplication result (the random index), making all indices 0 and shuffle deterministic. Replace with separate `uint64_t` variables for high and low words in both `_Single_bounded` and `_Batch_2`. 2. Loop advancement in `_Random_shuffle_batched` and `_Shuffle_unchecked_batched` had extra `++_UTarget` increments causing elements to be skipped. Restructure to `for` loops matching the original `_Random_shuffle1` pattern. Add regression test for microsoftGH-6112 and shuffle quality tests adapted from Lemire's cpp_batched_random test suite (uniformity, coverage, pair distribution at start/end positions). Co-Authored-By: Claude <noreply@anthropic.com>

StephanTLavavej · 2026-02-28T01:56:51Z

I shamelessly copied the last 4 tests from Daniel's batched_random repo test suite

I'm sorry, but this is not acceptable from a licensing perspective, and we have to be really strict about this. If his code isn't licensed under Apache 2 + LLVM (or Boost), we can't accept derived works from it. (And if it was, you'd still need to preserve his copyright.)

FranciscoThiesen · 2026-02-28T04:04:29Z

I shamelessly copied the last 4 tests from Daniel's batched_random repo test suite

I'm sorry, but this is not acceptable from a licensing perspective, and we have to be really strict about this. If his code isn't licensed under Apache 2 + LLVM (or Boost), we can't accept derived works from it. (And if it was, you'd still need to preserve his copyright.)

This was his own suggestion after checking the issue that happened with the previous version of this PR. I've sent him this pull request so he can confirm (and maybe add a license to his batched random codebase).

StephanTLavavej · 2026-02-28T12:08:11Z

If the upstream codebase is updated to Apache 2 + LLVM Exception, and the derived work here retains his copyright notice, then that will be acceptable. We cannot accept the MIT license.

lemire · 2026-02-28T13:40:03Z

@StephanTLavavej

If the upstream codebase is updated to Apache 2 + LLVM Exception,

Done.

(I did suggest using my tests. Why not? )

StephanTLavavej · 2026-02-28T14:03:37Z

Thanks Daniel! 😻 I know you're cool, we just have to be consistently strict about licensing and derived works, so that the codebase is never in question.

I'll review this hopefully in the near future (still trying to get through my post-<flat_meow> backlog).

lemire · 2026-02-28T19:57:21Z

@StephanTLavavej I fully understand. In any case, the base library is fully available and I have verified that it provides significant betters (under Visual Studio): https://github.com/lemire/cpp_batched_random.git The reason why batching helps is also rather intuitive. :-)

Align the batched random shuffle with Daniel Lemire's reference implementation (github.com/lemire/cpp_batched_random). The previous version only used batch-of-2 and did not use the threshold constants in loop conditions. This adopts the full cascade (batch 1-6) with backward Fisher-Yates iteration, matching Lemire's design exactly.

FranciscoThiesen · 2026-03-02T17:37:04Z

Here are some numbers from a quick benchmark on compiler-explorer, using the current implementation. Speed-up is consistent across different sizes.

I was missing a few constants from Daniel's original implementation, but that is solved now.

@lemire @StephanTLavavej

lemire · 2026-03-02T19:18:56Z

Here are the speedups we are talking about...

Size 256 (1953 reps):
  std::shuffle:     5.6775 ms
  batched::shuffle: 1.0903 ms
  Speedup: 5.20728x

Size 1024 (488 reps):
  std::shuffle:     5.5502 ms
  batched::shuffle: 0.8665 ms
  Speedup: 6.40531x

Size 4096 (122 reps):
  std::shuffle:     5.4195 ms
  batched::shuffle: 0.7209 ms
  Speedup: 7.51769x

Size 16384 (30 reps):
  std::shuffle:     5.3811 ms
  batched::shuffle: 0.7493 ms
  Speedup: 7.1815x

Size 65536 (7 reps):
  std::shuffle:     4.897 ms
  batched::shuffle: 0.8032 ms
  Speedup: 6.09686x

So it is considerable. More than 5x. Even accounting for the unavoidable variations and margins of errors, this is quite good.

…d_random

Copilot

Pull request overview

Reintroduces and fixes the batched random integer generation optimization for std::shuffle / std::ranges::shuffle (based on Brackett‑Rozinsky & Lemire), addressing the regressions described in #6112 and adding regression/quality tests to prevent recurrence.

Changes:

Add a batched shuffle implementation for full-range 64-bit URNGs and dispatch to it from std::shuffle and std::ranges::shuffle.
Implement a _Batched_rng_from_urng helper that extracts multiple bounded indices from one 64-bit random word using 128-bit multiplication and rejection sampling.
Add comprehensive regression and distribution-quality tests for both std::shuffle and ranges::shuffle batched paths.

Show a summary per file

File	Description
`stl/inc/algorithm`	Adds batched RNG + batched shuffle implementations and dispatches to them for full-range 64-bit URNGs in both `std::shuffle` and `std::ranges::shuffle`.
`tests/std/tests/P0896R4_ranges_alg_shuffle/test.cpp`	Adds regression and statistical/coverage tests that exercise both the batched (mt19937_64) and non-batched paths.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

stl/inc/algorithm:6724

Same self-swap issue as std::shuffle’s batched path: iter_swap is unconditionally invoked even when the selected offset refers to the same element being placed. The non-batched implementation skips this to avoid self-move-assignment. Add a guard to avoid calling iter_swap when the two iterators are equal (and apply consistently to the other iter_swap calls in this batched loop).

            for (; _Remaining > (uint64_t{1} << 30); --_Remaining) {
                _BatchedRng._Batch(_Offsets, _Remaining, 1, _Remaining);
                _RANGES iter_swap(
                    _First + static_cast<_Diff>(_Remaining - 1), _First + static_cast<_Diff>(_Offsets[0]));
            }

Files reviewed: 2/2 changed files
Comments generated: 1

Addresses Copilot review on PR microsoft#6119: the batched shuffle paths unconditionally invoked swap()/iter_swap() even when the offset selected the same element, which can call self-move-assignment on user-defined types. The non-batched _Random_shuffle1 already guards against this; mirror that for the batched _Random_shuffle_batched and _Shuffle_unchecked_batched.

FranciscoThiesen added 4 commits December 2, 2025 13:23

Address review feedback: extract URNG range check to variable template

3df8856

Move _Has_full_64bit_range check out of _Batched_rng_from_urng class to reduce template instantiation overhead. Use is_same_v and _Max_limit as suggested by reviewer.

Apply clang-format

8aa184a

Merge remote-tracking branch 'origin/main' into FranciscoGeimanThiese…

9f73ca6

…n/faster_shuffle_batched_random

FranciscoThiesen requested a review from a team as a code owner February 28, 2026 01:03

github-project-automation Bot added this to STL Code Reviews Feb 28, 2026

github-project-automation Bot moved this to Initial Review in STL Code Reviews Feb 28, 2026

FranciscoThiesen force-pushed the FranciscoGeimanThiesen/faster_shuffle_batched_random branch from 74fda0f to 86c0c0a Compare February 28, 2026 01:17

FranciscoThiesen changed the title ~~Francisco geiman thiesen/faster shuffle batched random~~ Faster shuffle batched random (based on: Nevin Brackett-Rozinsky, Daniel Lemire work) Feb 28, 2026

StephanTLavavej added the performance Must go faster label Feb 28, 2026

FranciscoThiesen added 2 commits March 18, 2026 15:11

Merge branch 'main' into FranciscoGeimanThiesen/faster_shuffle_batche…

973665d

…d_random

fixing merge

4c3d262

lemire mentioned this pull request Apr 16, 2026

<algorithm>: Investigate further optimizations for shuffle() and sample() #5736

Open

StephanTLavavej requested a review from Copilot April 29, 2026 06:36

Copilot started reviewing on behalf of StephanTLavavej April 29, 2026 06:37 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread stl/inc/algorithm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster shuffle batched random (based on: Nevin Brackett-Rozinsky, Daniel Lemire work)#6119

Faster shuffle batched random (based on: Nevin Brackett-Rozinsky, Daniel Lemire work)#6119
FranciscoThiesen wants to merge 9 commits into
microsoft:mainfrom
FranciscoThiesen:FranciscoGeimanThiesen/faster_shuffle_batched_random

FranciscoThiesen commented Feb 28, 2026

Uh oh!

StephanTLavavej commented Feb 28, 2026

Uh oh!

FranciscoThiesen commented Feb 28, 2026

Uh oh!

StephanTLavavej commented Feb 28, 2026

Uh oh!

lemire commented Feb 28, 2026

Uh oh!

StephanTLavavej commented Feb 28, 2026

Uh oh!

lemire commented Feb 28, 2026

Uh oh!

FranciscoThiesen commented Mar 2, 2026

Uh oh!

lemire commented Mar 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

FranciscoThiesen commented Feb 28, 2026

Summary

Bug 1: _Unsigned128 initialization overwrites the random index

Fix

Bug 2: Loop double-advancement skips elements

Fix

New tests added

Uh oh!

StephanTLavavej commented Feb 28, 2026

Uh oh!

FranciscoThiesen commented Feb 28, 2026

Uh oh!

StephanTLavavej commented Feb 28, 2026

Uh oh!

lemire commented Feb 28, 2026

Uh oh!

StephanTLavavej commented Feb 28, 2026

Uh oh!

lemire commented Feb 28, 2026

Uh oh!

FranciscoThiesen commented Mar 2, 2026

Uh oh!

lemire commented Mar 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants