Skip to content

Add SVE implementation of replace#6195

Open
hazzlim wants to merge 8 commits into
microsoft:mainfrom
hazzlim:replace-sve-pr
Open

Add SVE implementation of replace#6195
hazzlim wants to merge 8 commits into
microsoft:mainfrom
hazzlim:replace-sve-pr

Conversation

@hazzlim
Copy link
Copy Markdown
Contributor

@hazzlim hazzlim commented Mar 31, 2026

This PR adds an SVE implementation of replace. This algorithm was previously not vectorized using Neon, due to the absence of masked stores in the instruction set. See #4433 for why this is an issue.

Benchmark results ⏲️

Results are speedup values relative to the existing C code as a baseline - higher is better. Benchmark results were obtained running on a Neoverse N2 machine.

  MSVC Speedup Clang Speedup
r<std::uint8_t> 17.03 7.024
r<std::uint16_t> 10.17 3.767
r<std::uint32_t> 4.592 2.109
r<std::uint64_t> 2.475 1.23

@hazzlim hazzlim requested a review from a team as a code owner March 31, 2026 22:18
@github-project-automation github-project-automation Bot moved this to Initial Review in STL Code Reviews Mar 31, 2026
Comment thread stl/inc/algorithm Outdated
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture ARM64EC I can't believe it's not x64! labels Mar 31, 2026
@StephanTLavavej StephanTLavavej self-assigned this Apr 2, 2026
Comment thread stl/src/vector_algorithms.cpp Outdated
Comment thread stl/src/vector_algorithms.cpp Outdated
Comment thread stl/src/vector_algorithms.cpp Outdated
@StephanTLavavej StephanTLavavej removed their assignment Apr 3, 2026
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 3, 2026
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 15, 2026
@StephanTLavavej

This comment was marked as outdated.

@StephanTLavavej StephanTLavavej moved this from Merging to Blocked in STL Code Reviews Apr 16, 2026
@StephanTLavavej StephanTLavavej added the blocked Something is preventing work on this label Apr 16, 2026
@StephanTLavavej
Copy link
Copy Markdown
Member

Turns out I can't merge this until the MSVC-internal checked-in compiler is updated. The current 14.50 compiler can't understand the new <arm_sve.h> and ICEs quite horribly, and I don't see any possible workaround. (vector_algorithms.cpp has to be built by the checked-in compiler.) The good news is that we should only have to wait a month.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an SVE-backed implementation of std::replace for ARM64/ARM64EC, enabling vectorization for smaller element sizes where masked stores are available on SVE.

Changes:

  • Enable _VECTORIZED_REPLACE on ARM64/ARM64EC and add 1- and 2-byte replace entry points.
  • Implement SVE-based masked-load/compare/masked-store __std_replace_{1,2,4,8} in vector_algorithms.cpp.
  • Extend replace benchmarks to include uint8_t and uint16_t.
Show a summary per file
File Description
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Adjusts which vector algorithm tests are run under the ARM64EC “call all x64” configuration.
stl/src/vector_algorithms.cpp Adds SVE include and introduces SVE-based replace implementations for 1/2/4/8 byte elements on ARM64/ARM64EC.
stl/inc/xutility Enables replace vectorization for ARM64/ARM64EC and introduces _VECTORIZED_REPLACE_1_2.
stl/inc/algorithm Declares new __std_replace_1/2 and updates dispatch/safety logic to allow 1/2-byte vectorized replace on ARM.
benchmarks/src/replace.cpp Adds replace benchmarks for uint8_t and uint16_t.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 5/5 changed files
  • Comments generated: 2

Comment thread stl/inc/algorithm
Comment on lines 511 to +518
#if _VECTORIZED_REPLACE
#if _VECTORIZED_REPLACE_1_2
template <class _Iter>
constexpr bool _Have_masked_op_for_iter = true;
#else // ^^^ _VECTORIZED_REPLACE_1_2 / !_VECTORIZED_REPLACE_1_2 vvv
template <class _Iter>
constexpr bool _Have_masked_op_for_iter = sizeof(_Iter_value_t<_Iter>) >= 4; // avx masked op compatible size
#endif // ^^^ !_VECTORIZED_REPLACE_1_2 ^^^
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change enables vectorized replace for 1- and 2-byte element types on ARM64/ARM64EC (_VECTORIZED_REPLACE_1_2 / _Have_masked_op_for_iter = true). The existing vector-algorithm tests for replace() appear to only exercise the vectorized path when sizeof(T) >= 4, so the newly-enabled 1/2-byte vectorized code path may not be covered. Please extend/update the vector algorithm tests to validate replace() for char/unsigned char/short (and update any test assumptions/comments about only 4/8-byte vectorization).

Copilot uses AI. Check for mistakes.
Comment on lines +9688 to +9691
const size_t _Sve_vl = svcntb();
const size_t _Size_bytes = _Byte_length(_First, _Last);
const size_t _Full_vl_bytes = _Size_bytes & ~size_t{_Sve_vl - 1};

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

svcntb() returns the SVE vector length in bytes, which is only guaranteed to be a multiple of 16 (not necessarily a power of two). Computing _Full_vl_bytes with & ~(_Sve_vl - 1) and _Tail_length with & (_Sve_vl - 1) is therefore incorrect for VL values like 48, and can make _Stop_at not aligned to the loop step, leading to an infinite loop and/or out-of-bounds accesses. Use modulo/division instead (e.g., _Full_vl_bytes = _Size_bytes - (_Size_bytes % _Sve_vl) and _Tail_length = _Size_bytes % _Sve_vl).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVE vector length is (now) architecturally defined to be a power of 2 in range [128, 2048]

The architecturally defined SVL set is all powers of two from 128 to 2048 bits inclusive
(ARM ARM B1.4.2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64EC I can't believe it's not x64! ARM64 Related to the ARM64 architecture blocked Something is preventing work on this performance Must go faster

Projects

Status: Blocked

Development

Successfully merging this pull request may close these issues.

4 participants