将 32 位解包为 32 字节 SIMD 向量的最快方法 [英] Fastest way to unpack 32 bits to a 32 byte SIMD vector
问题描述
将 32 位存储在内存中的 uint32_t
中,将每一位解包到 AVX 寄存器的单独字节元素的最快方法是什么?这些位可以位于各自字节内的任何位置.
Having 32 bits stored in a uint32_t
in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte.
澄清一下,我的意思是位 0 到字节 0,位 1 到字节 1.显然字节中的所有其他位都为零.目前我能做的最好的是 2 PSHUFB
并且每个位置都有一个掩码寄存器.
to clarify, I mean bit 0 goes to byte 0, bit 1 to byte 1. Obviously all other bits within the byte on zero. Best I could at the moment is 2 PSHUFB
and having a mask register for each position.
如果uint32_t
是位图,那么对应的向量元素应该是0或非0.(也就是说,我们可以得到一个带有 vpcmpeqb
的向量掩码,针对全零向量).
If the uint32_t
is a bitmap, then the corresponding vector elements should be 0 or non-0. (i.e. so we could get a vector mask with a vpcmpeqb
against a vector of all-zero).
https://software.intel.com/en-us/forums/话题/283382
推荐答案
将 32 位整数 x
的 32 位广播"到 256 位 YMM 寄存器的 32 字节 z
或两个 128 位 XMM 寄存器 z_low
和 z_high
的 16 字节,您可以执行以下操作.
To "broadcast" the 32 bits of a 32-bit integer x
to 32 bytes of a 256-bit YMM register z
or 16 bytes of a two 128-bit XMM registers z_low
and z_high
you can do the following.
使用 AVX2:
__m256i y = _mm256_set1_epi32(x);
__m256i z = _mm256_shuffle_epi8(y,mask1);
z = _mm256_and_si256(z,mask2);
如果没有 AVX2,最好使用 SSE:
Without AVX2 it's best to do this with SSE:
__m128i y = _mm_set1_epi32(x);
__m128i z_low = _mm_shuffle_epi8(y,mask_low);
__m128i z_high = _mm_shuffle_epi8(y,mask_high);
z_low = _mm_and_si128(z_low ,mask2);
z_high = _mm_and_si128(z_high,mask2);
掩码和工作示例如下所示.如果您打算多次这样做,您可能应该在主循环之外定义掩码.
The masks and a working example are shown below. If you plan to do this several times you should probably define the masks outside of the main loop.
#include <immintrin.h>
#include <stdio.h>
int main() {
int x = 0x87654321;
static const char mask1a[32] = {
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x01, 0x01, 0x01, 0x01,
0x01, 0x01, 0x01, 0x01,
0x02, 0x02, 0x02, 0x02,
0x02, 0x02, 0x02, 0x02,
0x03, 0x03, 0x03, 0x03,
0x03, 0x03, 0x03, 0x03
};
static const char mask2a[32] = {
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
};
char out[32];
#if defined ( __AVX2__ )
__m256i mask2 = _mm256_loadu_si256((__m256i*)mask2a);
__m256i mask1 = _mm256_loadu_si256((__m256i*)mask1a);
__m256i y = _mm256_set1_epi32(x);
__m256i z = _mm256_shuffle_epi8(y,mask1);
z = _mm256_and_si256(z,mask2);
_mm256_storeu_si256((__m256i*)out,z);
#else
__m128i mask2 = _mm_loadu_si128((__m128i*)mask2a);
__m128i mask_low = _mm_loadu_si128((__m128i*)&mask1a[ 0]);
__m128i mask_high = _mm_loadu_si128((__m128i*)&mask1a[16]);
__m128i y = _mm_set1_epi32(x);
__m128i z_low = _mm_shuffle_epi8(y,mask_low);
__m128i z_high = _mm_shuffle_epi8(y,mask_high);
z_low = _mm_and_si128(z_low,mask2);
z_high = _mm_and_si128(z_high,mask2);
_mm_storeu_si128((__m128i*)&out[ 0],z_low);
_mm_storeu_si128((__m128i*)&out[16],z_high);
#endif
for(int i=0; i<8; i++) {
for(int j=0; j<4; j++) {
printf("%x ", out[4*i+j]);
}printf("
");
} printf("
");
}
<小时>
要在每个向量元素中获得 0 或 -1:
它需要一个额外的步骤 _mm256_cmpeq_epi8
针对全零.任何非零变为 0,零变为 -1.如果我们不想要这种反转,请使用 andnot
而不是 and
.它反转它的第一个操作数.
To get 0 or -1 in each vector element:
It takes one extra step _mm256_cmpeq_epi8
against all-zeros. Any non-zero turns into 0, and zero turns into -1. If we don't want this inversion, use andnot
instead of and
. It inverts its first operand.
__m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x); // we only use the low 32bits of each lane, but this is fine with AVX2
// Each byte gets the source byte containing the corresponding bit
__m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
__m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);
// this is the extra step: compare each byte == 0 to produce 0 or -1
return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
// alternative: compare against the AND mask to get 0 or -1,
// avoiding the need for a vector zero constant.
}
另见是否有与 intel avx2 中的 movemask 指令相反的指令? 用于其他元素大小.
Also see is there an inverse instruction to the movemask instruction in intel avx2? for other element sizes.
这篇关于将 32 位解包为 32 字节 SIMD 向量的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!