intel avx2 中是否有与 movemask 指令相反的指令? [英] is there an inverse instruction to the movemask instruction in intel avx2?

查看:36
本文介绍了intel avx2 中是否有与 movemask 指令相反的指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

movemask 指令采用 __m256i 并返回一个 int32,其中每一位(前 4、8 或所有 32 位,取决于输入向量元素类型)是相应向量元素的最高有效位.

The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element.

我想做相反的事情:取 32(其中只有 4、8 或 32 个最低有效位有意义),并获得 __m256i,其中设置了每个 int8、int32 或 int64 大小的块的最高有效位到原来的位.

I would like to do the inverse: take a 32 (where only the 4, 8 or 32 least significant bits are meaningful), and get a __m256i where the most significant bit of each int8, int32 or int64 sized block is set to the original bit.

基本上,我想从一个压缩的位掩码变成一个可以被其他 AVX2 指令(例如 maskstore、maskload、mask_gather)用作掩码的位掩码.

Basically, I want to go from a compressed bitmask to one that is usable as a mask by other AVX2 instructions (such as maskstore, maskload, mask_gather).

我无法快速找到执行此操作的说明,因此我在这里询问.如果没有一条具有该功能的指令,您是否可以想到一种巧妙的技巧,只需很少的指令就可以实现这一点?

I couldn't quickly find an instruction that does it, so I am asking here. If there isn't one instruction with that functionality, is there a clever hack you can think of that achieves this in very few instructions?

我目前的方法是使用 256 个元素的查找表.我想在没有发生太多其他事情的循环中使用此操作,以加快速度.请注意,我对实现此操作的长多指令序列或小循环不太感兴趣.

My current method is to use a 256 element lookup table. I want to use this operation within a loop where not much else is happening, to speed it up. Note, I'm not too interested in long multi-instruction sequences or little loops that implement this operation.

推荐答案

在 AVX2 或更早版本中没有单一指令.(AVX512可以直接使用位图形式的掩码,并且有将掩码扩展为向量的说明).

There is no single instruction in AVX2 or earlier. (AVX512 can use masks in bitmap form directly, and has an instruction to expand masks to vectors).

  • 4 bits -> 4 qwords in a YMM register: this answer: a LUT is good, ALU also good
  • 8 bits -> 8 dwords in a YMM register: this answer (or this without AVX2). ALU.
  • 16 bits -> 16 words: this answer with vpbroadcastw / vpand / vpcmpeqw
  • 32 bits -> 32 bytes:
    How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
    Also Fastest way to unpack 32 bits to a 32 byte SIMD vector.
  • 8 bits -> 8 bytes or words without AVX2: How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD pretty cheap, although an 8-bit or 16-bit broadcast of the mask without SSSE3 can cost multiple shuffles.

注意使用 _mm_min_epu8(v, _mm_set1_epi8(1))
代替 _mm_cmpeq_epi8 得到0/1而不是0/FF.

Note the trick of using _mm_min_epu8(v, _mm_set1_epi8(1))
instead of _mm_cmpeq_epi8 to get 0/1 instead of 0/FF.

16 位 ->16 字节,带有 SSE2 或 SSSE3,或 AVX-512:将 16 位掩码转换为 16 字节掩码.
(还有用于 unsigned __int128 的 BMI2、纯 C++ 乘法 bithack 和用于获取 0/1 而不是 0/-1 的 AVX-512 示例)

16 bits -> 16 bytes with SSE2 or SSSE3, or AVX-512: Convert 16 bits mask to 16 bytes mask.
(Also BMI2 for unsigned __int128, pure C++ multiply bithack, and AVX-512 example for getting 0/1 instead of 0/-1)

8 位 ->8 字节:如果您一次只需要 8 位,标量乘法技巧可能会更好:如何从 8 个布尔值中创建一个字节(反之亦然)?.

8 bits -> 8 bytes: scalar multiply tricks are probably better if you only want 8 bits at a time: How to create a byte out of 8 bool values (and vice versa)?.

对于您的情况,如果您是从内存加载位图,则将其直接加载到 ALU 策略的向量寄存器中,即使对于 4 位掩码也应该可以正常工作.

For your case, if you're loading the bitmap from memory, loading it straight into vector registers for an ALU strategy should work well even for 4-bit masks.

如果您将位图作为计算结果,那么它将位于整数寄存器中,您可以轻松地将其用作 LUT 索引,因此如果您的目标是 64 位元素,这是一个不错的选择.否则可能仍然使用 ALU 处理 32 位或更小的元素,而不是一个巨大的 LUT 或执行多个块.

If you have the bitmap as a computation result, then it will be in an integer register where you can use it as a LUT index easily, so that's a good choice if you're aiming for 64-bit elements. Otherwise probably still go ALU for 32-bit elements or smaller, instead of a giant LUT or doing multiple chunks.

在从整数位掩码到向量掩码的廉价转换成为可能之前,我们将不得不等待 AVX-512 的掩码寄存器.(使用 kmovw k1, r/m16,编译器为 int => __mmask16 隐式生成).有一个 AVX512 insn 可以从掩码(VPMOVM2D zmm1, k1_mm512_movm_epi8/16/32/64,其他版本适用于不同的元素大小),但你通常不需要它,因为过去使用掩码向量的所有内容现在都使用掩码寄存器.也许如果您想计算满足某些比较条件的元素?(您将使用 pcmpeqd/psubd 生成和累积 0 或 -1 个元素的向量).但是在掩码结果上使用标量 popcnt 会更好.

We'll have to wait for AVX-512's mask registers before cheap conversion from integer bitmasks to vector masks are possible. (With kmovw k1, r/m16, which compilers generate implicitly for int => __mmask16). There's an AVX512 insn to set a vector from a mask (VPMOVM2D zmm1, k1, _mm512_movm_epi8/16/32/64, with other versions for different element sizes), but you generally don't need it since everything that used to use mask vectors now uses mask registers. Maybe if you want to count elements that meet some comparison condition? (where you'd use pcmpeqd / psubd to generate and accumulate the vector of 0 or -1 elements). But scalar popcnt on the mask results would be a better bet.

但请注意,vpmovm2d 要求掩码位于 AVX512 k0..7 掩码寄存器中.到达那里需要额外的指令,除非它来自向量比较结果,并且移动到掩码寄存器的指令需要一个用于 Intel Skylake-X 和类似 CPU 上的端口 5 的 uop,因此这可能是一个瓶颈(特别是如果您进行任何洗牌).特别是如果它在内存中启动(加载位图)并且您只需要每个元素的高位,那么即使 256 位和 512 位 AVX512 指令可用,广播加载 + 变量移位可能仍然更好.

But note that vpmovm2d requires the mask to be in an AVX512 k0..7 mask register. Getting it there will take extra instructions unless it came from a vector compare result, and instructions that move into mask registers need a uop for port 5 on Intel Skylake-X and similar CPUs so this can be a bottleneck (especially if you do any shuffles). Especially if it starts in memory (loading a bitmap) and you only need the high bit of each element, you're probably still better off with a broadcast load + variable shift even if 256-bit and 512-bit AVX512 instructions are available.

也可能(对于 0/1 结果而不是 0/-1)是来自诸如 _mm_maskz_mov_epi8(mask16, _mm_set1_epi8(1)) 之类的常量的零屏蔽负载.https://godbolt.org/z/1sM8hY8Tj

Also possible (for a 0/1 result instead of 0/-1) is a zero-masking load from a constant like _mm_maskz_mov_epi8(mask16, _mm_set1_epi8(1)). https://godbolt.org/z/1sM8hY8Tj

对于 64 位元素,掩码只有 4 位,因此查找表是合理的.您可以通过使用 VPMOVSXBQ ymm1, xmm2/m32 加载来压缩 LUT.(_mm256_cvtepi8_epi64).这为您提供 (1<<4) = 16 * 4 字节 = 64B = 1 个缓存行的 LUT 大小.不幸的是,pmovsx 不方便用作带有内在函数的窄负载.

For 64-bit elements, the mask only has 4 bits, so a lookup table is reasonable. You can compress the LUT by loading it with VPMOVSXBQ ymm1, xmm2/m32. (_mm256_cvtepi8_epi64). This gives you a LUT size of (1<<4) = 16 * 4 bytes = 64B = 1 cache line. Unfortunately, pmovsx is inconvenient to use as a narrow load with intrinsics.

特别是如果您已经将位图保存在整数寄存器(而不是内存)中,vpmovsxbq LUT 在 64 位元素的内部循环中应该非常出色.或者,如果指令吞吐量或混洗吞吐量是瓶颈,请使用未压缩的 LUT.这可以让您(或编译器)将掩码向量用作其他内容的内存操作数,而不需要单独的指令来加载它.

Especially if you already have your bitmap in an integer register (instead of memory), a vpmovsxbq LUT should be excellent inside an inner loop for 64-bit elements. Or if instruction throughput or shuffle throughput is a bottleneck, use an uncompressed LUT. This can let you (or the compiler) use the mask vector as a memory operand for something else, instead of needing a separate instruction to load it.

用于 32 位元素的 LUT:可能不是最佳选择,但您可以这样做

对于 32 位元素,8 位掩码为您提供 256 个可能的向量,每个向量有 8 个元素长.256 * 8B = 2048 字节,即使对于压缩版本(使用 vpmovsxbd ymm, m64 加载),这也是一个相当大的缓存占用空间.

With 32-bit elements, an 8-bit mask gives you 256 possible vectors, each 8 elements long. 256 * 8B = 2048 bytes, which is a pretty big cache footprint even for the compressed version (load with vpmovsxbd ymm, m64).

要解决此问题,您可以将 LUT 拆分为 4 位块.将一个 8 位整数拆分为两个 4 位整数(mov/and/shr)大约需要 3 条整数指令.然后使用 128b 个向量的未压缩 LUT(对于 32 位元素大小),vmovdqa 低半部分和 vinserti128 高半部分.您仍然可以压缩 LUT,但我不会推荐它,因为您需要 vmovd/vpinsrd/vpmovsxbd,这是 2 次 shuffle(所以你可能会遇到 uop 吞吐量的瓶颈).

To work around this, you can split the LUT into 4-bit chunks. It takes about 3 integer instructions to split up an 8-bit integer into two 4-bit integers (mov/and/shr). Then with an uncompressed LUT of 128b vectors (for 32-bit element size), vmovdqa the low half and vinserti128 the high half. You could still compress the LUT, but I wouldn't recommend it because you'll need vmovd / vpinsrd / vpmovsxbd, which is 2 shuffles (so you probably bottleneck on uop throughput).

或者 2x vpmovsxbd xmm, [lut + rsi*4] + vinserti128 在 Intel 上可能更糟.

Or 2x vpmovsxbd xmm, [lut + rsi*4] + vinserti128 is probably even worse on Intel.

当整个位图适合每个元素时:广播它,并使用选择器掩码,以及针对相同常量的 VPCMPEQ(可以在循环中多次使用它时保留在寄存器中).

When the whole bitmap fits in each element: broadcast it, AND with a selector mask, and VPCMPEQ against the same constant (which can stay in a register across multiple uses of this in a loop).

vpbroadcastd  ymm0,  dword [mask]            ; _mm256_set1_epi32
vpand         ymm0, ymm0,  setr_epi32(1<<0, 1<<1, 1<<2, 1<<3, ..., 1<<7)
vpcmpeqd      ymm0, ymm0,  [same constant]   ; _mm256_cmpeq_epi32
      ; ymm0 =  (mask & bit) == bit
      ; where bit = 1<<element_number

掩码可以来自带有 vmovd + vpbroadcastd 的整数寄存器,但是如果广播负载已经在内存中,例如从掩码数组应用于元素数组.我们实际上只关心那个双字的低 8 位,因为 8x 32 位元素 = 32 字节.(例如,您从 vmovmaskps 获得).对于 16x 16 位元素的 16 位掩码,您需要 vpbroadcastw.要首先从 16 位整数向量中获得这样的掩码,您可以将 vpacksswb 两个向量放在一起(保留每个元素的符号位),vpermq 将in-lane 打包后将元素按顺序排列,然后 vpmovmskb.

The mask could come from an integer register with vmovd + vpbroadcastd, but a broadcast-load is cheap if it's already in memory, e.g. from a mask array to apply to an array of elements. We actually only care about the low 8 bits of that dword because 8x 32-bit elements = 32 bytes. (e.g. that you got from vmovmaskps). With a 16-bit mask for 16x 16-bit elements, you need vpbroadcastw. To get such a mask in the first place from 16-bit integer vectors, you might vpacksswb two vectors together (which preserves the sign bit of each element), vpermq to put the elements into sequential order after in-lane pack, then vpmovmskb.

对于 8 位元素,您需要 vpshufb vpbroadcastd 结果以将相关位放入每个字节.请参阅如何执行_mm256_movemask_epi8 (VPMOVMSKB) 的逆运算?.但是对于 16 位和更宽的元素,元素的数量 <= 元素宽度,所以广播加载是免费的.(与完全在加载端口处理的 32 位和 64 位广播加载不同,16 位广播加载确实需要一个微融合 ALU shuffle uop.)

For 8-bit elements, you will need to vpshufb the vpbroadcastd result to get the relevant bit into each byte. See How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?. But for 16-bit and wider elements, the number of elements is <= the element width, so a broadcast-load does this for free. (16-bit broadcast loads do cost a micro-fused ALU shuffle uop, unlike 32 and 64-bit broadcast loads which are handled entirely in the load ports.)

vpbroadcastd/q 甚至不需要任何 ALU uops,它就在加载端口中完成.(bw 是加载+随机播放).即使您的掩码打包在一起(对于 32 位或 64 位元素,每个字节一个),使用 vpbroadcastd 而不是 vpbroadcastb 可能仍然更有效.x &mask == mask check 不关心广播后每个元素高字节的垃圾.唯一担心的是缓存行/页面拆分.

vpbroadcastd/q doesn't even cost any ALU uops, it's done right in the load port. (b and w are load+shuffle). Even if there your masks are packed together (one per byte for 32 or 64-bit elements), it might still be more efficient to vpbroadcastd instead of vpbroadcastb. The x & mask == mask check doesn't care about garbage in the high bytes of each element after the broadcast. The only worry is cache-line / page splits.

变量混合和屏蔽加载/存储只关心屏蔽元素的符号位.

Variable blends and masked loads/stores only care about the sign bit of the mask elements.

一旦您将 8 位掩码广播到 dword 元素,这只是 1 uop(在 Skylake 上).

This is only 1 uop (on Skylake) once you have the 8-bit mask broadcast to dword elements.

vpbroadcastd  ymm0, dword [mask]

vpsllvd       ymm0, ymm0, [vec of 24, 25, 26, 27, 28, 29, 30, 31]  ; high bit of each element = corresponding bit of the mask

;vpsrad        ymm0, ymm0, 31                          ; broadcast the sign bit of each element to the whole element
;vpsllvd + vpsrad has no advantage over vpand / vpcmpeqb, so don't use this if you need all the bits set.

vpbroadcastd 与从内存加载一样便宜(在 Intel CPU 和 Ryzen 上根本没有 ALU uop).(较窄的广播,例如 vpbroadcastb y,mem 在 Intel 上采用 ALU shuffle uop,但在 Ryzen 上可能不采用.)

vpbroadcastd is as cheap as a load from memory (no ALU uop at all on Intel CPUs and Ryzen). (Narrower broadcasts, like vpbroadcastb y,mem take an ALU shuffle uop on Intel, but maybe not on Ryzen.)

Haswell/Broadwell 上的可变移位稍贵(3 uop,有限的执行端口),但与 Skylake 上的立即计数移位一样便宜!(端口 0 或 1 上的 1 uop.)在 Ryzen 上,它们也只有 2 uop(任何 256b 操作的最小值),但具有 3c 延迟和每 4c 吞吐量一个.

The variable-shift is slightly expensive on Haswell/Broadwell (3 uops, limited execution ports), but as cheap as immediate-count shifts on Skylake! (1 uop on port 0 or 1.) On Ryzen they're also only 2 uops (the minimum for any 256b operation), but have 3c latency and one per 4c throughput.

请参阅 标签维基性能信息,尤其是 Agner Fog 的 insn 表.

See the x86 tag wiki for perf info, especially Agner Fog's insn tables.

对于 64 位元素,请注意算术右移仅适用于 16 位和 32 位元素大小.如果您希望将整个元素设置为 4 位全零/全一,请使用不同的策略 ->64 位元素.

For 64-bit elements, note that arithmetic right shifts are only available in 16 and 32-bit element size. Use a different strategy if you want the whole element set to all-zero / all-one for 4 bits -> 64-bit elements.

使用内在函数:

__m256i bitmap2vecmask(int m) {
    const __m256i vshift_count = _mm256_set_epi32(24, 25, 26, 27, 28, 29, 30, 31);
    __m256i bcast = _mm256_set1_epi32(m);
    __m256i shifted = _mm256_sllv_epi32(bcast, vshift_count);  // high bit of each element = corresponding bit of the mask
    return shifted;

    // use _mm256_and and _mm256_cmpeq if you need all bits set.
    //return _mm256_srai_epi32(shifted, 31);             // broadcast the sign bit to the whole element
}

在循环内,LUT 可能值得占用缓存空间,具体取决于循环中的指令组合.特别是对于 64 位元素大小,它不会占用太多缓存空间,但即使是 32 位也可能如此.

Inside a loop, a LUT might be worth the cache footprint, depending on the instruction mix in the loop. Especially for 64-bit element size where it's not much cache footprint, but possibly even for 32-bit.

; 8bit mask bitmap in eax, constant in rdi

pdep      rax, rax, rdi   ; rdi = 0b1000000010000000... repeating
vmovq     xmm0, rax
vpmovsxbd ymm0, xmm0      ; each element = 0xffffff80 or 0

; optional
;vpsrad    ymm0, ymm0, 8   ; arithmetic shift to get -1 or 0

如果您已经在整数寄存器中有掩码(无论如何您必须分别vmovq/vpbroadcastd),那么即使在 Skylake 上,这种方式也可能更好可变计数班次很便宜.

If you already have masks in an integer register (where you'd have to vmovq / vpbroadcastd separately anyway), then this way is probably better even on Skylake where variable-count shifts are cheap.

如果您的掩码在内存中启动,则其他 ALU 方法(vpbroadcastd 直接转换为向量)可能更好,因为广播负载非常便宜.

If your masks start in memory, the other ALU method (vpbroadcastd directly into a vector) is probably better, because broadcast-loads are so cheap.

请注意,pdep 是 Ryzen 上的 6 个依赖 uops(18c 延迟,18c 吞吐量),因此即使您的掩码确实以整数 regs 开头,此方法在 Ryzen 上也很糟糕.

Note that pdep is 6 dependent uops on Ryzen (18c latency, 18c throughput), so this method is horrible on Ryzen even if your masks do start in integer regs.

(未来的读者,可以随意在它的内部版本中进行编辑.编写 asm 更容易,因为它打字少了很多,而且 asm 助记符更容易阅读(没有愚蠢的 _mm256_ 混乱)到处都是.)

(Future readers, feel free to edit in an intrinsics version of this. It's easier to write asm because it's a lot less typing, and the asm mnemonics are easier to read (no stupid _mm256_ clutter all over the place).)

这篇关于intel avx2 中是否有与 movemask 指令相反的指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆