Add SVE implementation of replace#6195
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
Turns out I can't merge this until the MSVC-internal checked-in compiler is updated. The current 14.50 compiler can't understand the new |
There was a problem hiding this comment.
Pull request overview
Adds an SVE-backed implementation of std::replace for ARM64/ARM64EC, enabling vectorization for smaller element sizes where masked stores are available on SVE.
Changes:
- Enable
_VECTORIZED_REPLACEon ARM64/ARM64EC and add 1- and 2-bytereplaceentry points. - Implement SVE-based masked-load/compare/masked-store
__std_replace_{1,2,4,8}invector_algorithms.cpp. - Extend replace benchmarks to include
uint8_tanduint16_t.
Show a summary per file
| File | Description |
|---|---|
| tests/std/tests/VSO_0000000_vector_algorithms/test.cpp | Adjusts which vector algorithm tests are run under the ARM64EC “call all x64” configuration. |
| stl/src/vector_algorithms.cpp | Adds SVE include and introduces SVE-based replace implementations for 1/2/4/8 byte elements on ARM64/ARM64EC. |
| stl/inc/xutility | Enables replace vectorization for ARM64/ARM64EC and introduces _VECTORIZED_REPLACE_1_2. |
| stl/inc/algorithm | Declares new __std_replace_1/2 and updates dispatch/safety logic to allow 1/2-byte vectorized replace on ARM. |
| benchmarks/src/replace.cpp | Adds replace benchmarks for uint8_t and uint16_t. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 5/5 changed files
- Comments generated: 2
| #if _VECTORIZED_REPLACE | ||
| #if _VECTORIZED_REPLACE_1_2 | ||
| template <class _Iter> | ||
| constexpr bool _Have_masked_op_for_iter = true; | ||
| #else // ^^^ _VECTORIZED_REPLACE_1_2 / !_VECTORIZED_REPLACE_1_2 vvv | ||
| template <class _Iter> | ||
| constexpr bool _Have_masked_op_for_iter = sizeof(_Iter_value_t<_Iter>) >= 4; // avx masked op compatible size | ||
| #endif // ^^^ !_VECTORIZED_REPLACE_1_2 ^^^ |
There was a problem hiding this comment.
This change enables vectorized replace for 1- and 2-byte element types on ARM64/ARM64EC (_VECTORIZED_REPLACE_1_2 / _Have_masked_op_for_iter = true). The existing vector-algorithm tests for replace() appear to only exercise the vectorized path when sizeof(T) >= 4, so the newly-enabled 1/2-byte vectorized code path may not be covered. Please extend/update the vector algorithm tests to validate replace() for char/unsigned char/short (and update any test assumptions/comments about only 4/8-byte vectorization).
| const size_t _Sve_vl = svcntb(); | ||
| const size_t _Size_bytes = _Byte_length(_First, _Last); | ||
| const size_t _Full_vl_bytes = _Size_bytes & ~size_t{_Sve_vl - 1}; | ||
|
|
There was a problem hiding this comment.
svcntb() returns the SVE vector length in bytes, which is only guaranteed to be a multiple of 16 (not necessarily a power of two). Computing _Full_vl_bytes with & ~(_Sve_vl - 1) and _Tail_length with & (_Sve_vl - 1) is therefore incorrect for VL values like 48, and can make _Stop_at not aligned to the loop step, leading to an infinite loop and/or out-of-bounds accesses. Use modulo/division instead (e.g., _Full_vl_bytes = _Size_bytes - (_Size_bytes % _Sve_vl) and _Tail_length = _Size_bytes % _Sve_vl).
There was a problem hiding this comment.
SVE vector length is (now) architecturally defined to be a power of 2 in range [128, 2048]
The architecturally defined SVL set is all powers of two from 128 to 2048 bits inclusive
(ARM ARM B1.4.2)
This PR adds an SVE implementation of
replace. This algorithm was previously not vectorized using Neon, due to the absence of masked stores in the instruction set. See #4433 for why this is an issue.Benchmark results ⏲️
Results are speedup values relative to the existing C code as a baseline - higher is better. Benchmark results were obtained running on a Neoverse N2 machine.
r<std::uint8_t>r<std::uint16_t>r<std::uint32_t>r<std::uint64_t>