Add SVE implementation of `replace` by hazzlim · Pull Request #6195 · microsoft/STL

hazzlim · 2026-03-31T22:18:44Z

This PR adds an SVE implementation of replace. This algorithm was previously not vectorized using Neon, due to the absence of masked stores in the instruction set. See #4433 for why this is an issue.

Benchmark results ⏲️

Results are speedup values relative to the existing C code as a baseline - higher is better. Benchmark results were obtained running on a Neoverse N2 machine.

	MSVC Speedup	Clang Speedup
`r<std::uint8_t>`	17.03	7.024
`r<std::uint16_t>`	10.17	3.767
`r<std::uint32_t>`	4.592	2.109
`r<std::uint64_t>`	2.475	1.23

StephanTLavavej · 2026-04-16T05:12:42Z

Turns out I can't merge this until the MSVC-internal checked-in compiler is updated. The current 14.50 compiler can't understand the new <arm_sve.h> and ICEs quite horribly, and I don't see any possible workaround. (vector_algorithms.cpp has to be built by the checked-in compiler.) The good news is that we should only have to wait a month.

Copilot

Pull request overview

Adds an SVE-backed implementation of std::replace for ARM64/ARM64EC, enabling vectorization for smaller element sizes where masked stores are available on SVE.

Changes:

Enable _VECTORIZED_REPLACE on ARM64/ARM64EC and add 1- and 2-byte replace entry points.
Implement SVE-based masked-load/compare/masked-store __std_replace_{1,2,4,8} in vector_algorithms.cpp.
Extend replace benchmarks to include uint8_t and uint16_t.

Show a summary per file

File	Description
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp	Adjusts which vector algorithm tests are run under the ARM64EC “call all x64” configuration.
stl/src/vector_algorithms.cpp	Adds SVE include and introduces SVE-based `replace` implementations for 1/2/4/8 byte elements on ARM64/ARM64EC.
stl/inc/xutility	Enables replace vectorization for ARM64/ARM64EC and introduces `_VECTORIZED_REPLACE_1_2`.
stl/inc/algorithm	Declares new `__std_replace_1/2` and updates dispatch/safety logic to allow 1/2-byte vectorized replace on ARM.
benchmarks/src/replace.cpp	Adds `replace` benchmarks for `uint8_t` and `uint16_t`.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 5/5 changed files
Comments generated: 2

Copilot · 2026-04-29T19:36:41Z

 #if _VECTORIZED_REPLACE
+#if _VECTORIZED_REPLACE_1_2
+template <class _Iter>
+constexpr bool _Have_masked_op_for_iter = true;
+#else // ^^^ _VECTORIZED_REPLACE_1_2 / !_VECTORIZED_REPLACE_1_2 vvv
+template <class _Iter>
+constexpr bool _Have_masked_op_for_iter = sizeof(_Iter_value_t<_Iter>) >= 4; // avx masked op compatible size
+#endif // ^^^ !_VECTORIZED_REPLACE_1_2 ^^^


This change enables vectorized replace for 1- and 2-byte element types on ARM64/ARM64EC (_VECTORIZED_REPLACE_1_2 / _Have_masked_op_for_iter = true). The existing vector-algorithm tests for replace() appear to only exercise the vectorized path when sizeof(T) >= 4, so the newly-enabled 1/2-byte vectorized code path may not be covered. Please extend/update the vector algorithm tests to validate replace() for char/unsigned char/short (and update any test assumptions/comments about only 4/8-byte vectorization).

Copilot · 2026-04-29T19:36:42Z

+                const size_t _Sve_vl        = svcntb();
+                const size_t _Size_bytes    = _Byte_length(_First, _Last);
+                const size_t _Full_vl_bytes = _Size_bytes & ~size_t{_Sve_vl - 1};
+


svcntb() returns the SVE vector length in bytes, which is only guaranteed to be a multiple of 16 (not necessarily a power of two). Computing _Full_vl_bytes with & ~(_Sve_vl - 1) and _Tail_length with & (_Sve_vl - 1) is therefore incorrect for VL values like 48, and can make _Stop_at not aligned to the loop step, leading to an infinite loop and/or out-of-bounds accesses. Use modulo/division instead (e.g., _Full_vl_bytes = _Size_bytes - (_Size_bytes % _Sve_vl) and _Tail_length = _Size_bytes % _Sve_vl).

SVE vector length is (now) architecturally defined to be a power of 2 in range [128, 2048]

The architecturally defined SVL set is all powers of two from 128 to 2048 bits inclusive
(ARM ARM B1.4.2)

hazzlim added 2 commits March 31, 2026 13:26

Add replace benchmarks for all types

ae6c3ed

Add SVE implementation of replace

f3b8b1c

hazzlim requested a review from a team as a code owner March 31, 2026 22:18

github-project-automation Bot added this to STL Code Reviews Mar 31, 2026

github-project-automation Bot moved this to Initial Review in STL Code Reviews Mar 31, 2026

AlexGuteniev reviewed Mar 31, 2026

View reviewed changes

Comment thread stl/inc/algorithm Outdated

StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture ARM64EC I can't believe it's not x64! labels Mar 31, 2026

StephanTLavavej and others added 4 commits April 2, 2026 09:28

Merge branch 'main' into replace-sve-pr

37a368b

Add braces, fix endif comments, clang-format.

7437ad0

Further reduce test coverage for ARM64EC fallbacks.

2adfc55

Add _VECTORIZED_REPLACE_1_2 macro

9c45110

StephanTLavavej self-assigned this Apr 2, 2026

StephanTLavavej added 2 commits April 3, 2026 08:03

Remove const to match declarations.

9b3bdd5

Add const.

ae5d1f3

StephanTLavavej reviewed Apr 3, 2026

View reviewed changes

Comment thread stl/src/vector_algorithms.cpp Outdated

Comment thread stl/src/vector_algorithms.cpp Outdated

Comment thread stl/src/vector_algorithms.cpp Outdated

StephanTLavavej approved these changes Apr 3, 2026

View reviewed changes

StephanTLavavej removed their assignment Apr 3, 2026

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 3, 2026

StephanTLavavej mentioned this pull request Apr 7, 2026

Extend vectorized algorithms to ARM64/ARM64EC #813

Closed

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 15, 2026

This comment was marked as outdated.

Sign in to view

StephanTLavavej moved this from Merging to Blocked in STL Code Reviews Apr 16, 2026

StephanTLavavej added the blocked Something is preventing work on this label Apr 16, 2026

StephanTLavavej requested a review from Copilot April 29, 2026 19:18

Copilot started reviewing on behalf of StephanTLavavej April 29, 2026 19:19 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SVE implementation of `replace`#6195

Add SVE implementation of `replace`#6195
hazzlim wants to merge 8 commits into
microsoft:mainfrom
hazzlim:replace-sve-pr

hazzlim commented Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

StephanTLavavej commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

hazzlim Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hazzlim commented Mar 31, 2026

Benchmark results ⏲️

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

StephanTLavavej commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

hazzlim Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants