从SIMD向量中提取设置的字节位置 [英] Extract set bytes position from SIMD vector
问题描述
我使用SIMD指令运行一系列计算.这些指令返回一个16字节的向量,结果为compare
,每个字节为0x00
或0xff
:
I run a bench of computations using SIMD intructions. These instructions return a vector of 16 bytes as result, named compare
, with each byte being 0x00
or 0xff
:
0 1 2 3 4 5 6 7 15 16
compare : 0x00 0x00 0x00 0x00 0xff 0x00 0x00 0x00 ... 0xff 0x00
将字节设置为0xff
意味着我需要运行函数do_operation(i)
,而我是字节的位置.
Bytes set to 0xff
mean I need to run the function do_operation(i)
with i being the position of the byte.
例如,上面的compare
向量表示,我需要运行以下操作序列:
For instance, the above compare
vector mean, I need to run this sequence of operations :
do_operation(4);
do_operation(15);
这是到目前为止我想到的最快的解决方案:
Here is the fastest solution I came up with until now :
for(...) {
//
// SIMD computations
//
__m128i compare = ... // Result of SIMD computations
// Extract high and low quadwords for compare vector
std::uint64_t cmp_low = (_mm_cvtsi128_si64(compare));
std::uint64_t cmp_high = (_mm_extract_epi64(compare, 1));
// Process low quadword
if (cmp_low) {
const std::uint64_t low_possible_positions = 0x0706050403020100;
const std::uint64_t match_positions = _pext_u64(
low_possible_positions, cmp_low);
const int match_count = _popcnt64(cmp_low) / 8;
const std::uint8_t* match_pos_array =
reinterpret_cast<const std::uint8_t*>(&match_positions);
for (int i = 0; i < match_count; ++i) {
do_operation(i);
}
}
// Process high quadword (similarly)
if (cmp_high) {
const std::uint64_t high_possible_positions = 0x0f0e0d0c0b0a0908;
const std::uint64_t match_positions = _pext_u64(
high_possible_positions, cmp_high);
const int match_count = _popcnt64(cmp_high) / 8;
const std::uint8_t* match_pos_array =
reinterpret_cast<const std::uint8_t*>(&match_positions);
for(int i = 0; i < match_count; ++i) {
do_operation(i);
}
}
}
我首先提取128位向量的第一和第二64位整数(cmp_low
和cmp_high
).然后,我使用popcount
计算设置为0xff
的字节数(设置为1的位数除以8).最后,我使用pext
来获取没有零的位置,如下所示:
I start with extracting the first and second 64 bits integers of the 128 bits vector (cmp_low
and cmp_high
). Then I use popcount
to compute the number of bytes set to 0xff
(number of bits set to 1 divided by 8). Finally, I use pext
to get positions, without zeros, like this :
0x0706050403020100
0x000000ff00ff0000
|
PEXT
|
0x0000000000000402
我想找到一个更快的解决方案,以提取compare
向量中设置为0xff
的字节的位置.更准确地说,compare
向量中的0xff
通常仅将 0、1或2个字节设置为0xff
,我想使用此信息来避免出现某些分支.
I would like to find a faster solution to extract the positions of the bytes set to 0xff
in the compare
vector. More precisely, the are very often only 0, 1 or 2 bytes set to 0xff
in the compare
vector and I would like to use this information to avoid some branches.
推荐答案
以下是如何减少测试数量的简要概述:
Here's a quick outline of how you could reduce the number of tests:
-
首先使用一个函数将128bit整数的每个字节的所有lsb或msb投影为16bit值(例如,在X86 cpus上有一个 SSE2 汇编指令:
pmovmskb
,使用在Intel和MS编译器上受支持_mm_movemask_pi8
内部函数,而gcc也具有内部函数:__builtin_ia32_ppmovmskb128
,);
First use a function to project all the lsb or msb of each byte of your 128bit integer into a 16bit value (for instance, there's a SSE2 assembly instruction for that on X86 cpus:
pmovmskb
, which is supported on Intel and MS compilers with the_mm_movemask_pi8
intrinsic, and gcc has also an intrinsic:__builtin_ia32_ppmovmskb128
, );
然后将该值分成4个半字节;
Then split that value in 4 nibbles;
定义函数来处理半字节的每个可能值(从0到15)并将它们放入数组中;
define functions to handle each possible values of a nibble (from 0 to 15) and put these in an array;
最后调用每个半字节所索引的函数(带有额外的参数以指示16位中的哪个半字节).
Finally call the function indexed by each nibble (with extra parameters to indicate which nibble in the 16bits it is).
这篇关于从SIMD向量中提取设置的字节位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!