英特尔avx2中的movemask指令是否有反指令? [英] is there an inverse instruction to the movemask instruction in intel avx2?

查看:145
本文介绍了英特尔avx2中的movemask指令是否有反指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

movemask指令采用__m256i并返回一个int32,其中每个位(前4位,8位或全部32位,取决于输入矢量元素的类型)是相应矢量元素的最高有效位./p>

我想做个逆运算:取一个32(只有4、8或32个最低有效位才有意义),然后得到一个__m256i,其中每个int8,int32或int64大小的块的最高有效位均已设置到原始位.

基本上,我想从压缩的位掩码转到可以被其他AVX2指令(例如maskstore,maskload,mask_gather)用作掩码的位掩码.

我无法快速找到执行该指令的指令,所以我在这里问. 如果没有一个具有该功能的指令,那么您能想到有一个巧妙的技巧可以通过很少的指令来实现这一目标吗?

我当前的方法是使用256个元素的查找表. 我想在没有太多其他事情发生的循环中使用此操作,以加快速度.注意,我对长的多指令序列或实现此操作的小循环不太感兴趣.

解决方案

在AVX2或更早的版本中没有单一指令. (AVX512可以直接使用位图形式的掩码,具有将掩码扩展为向量的指令.)


如果要从内存中加载位图,则可以将其直接加载到用于ALU策略的向量寄存器中.

如果将位图作为计算结果,则它将在整数寄存器中,您可以在其中轻松地将其用作LUT索引,因此,如果您打算使用64位元素,则这是一个不错的选择.否则,对于32位或更小的元素,仍然可能会采用ALU,而不是使用大型LUT或执行多个块.


我们必须等待AVX-512的掩码寄存器,然后才能进行从整数位掩码到矢量掩码的廉价转换. (对于kmovw k1, r/m16,编译器将为int => __mmask16隐式生成该代码).有一个AVX512 insn可以从蒙版(VPMOVM2D zmm1, k1 VPMOVSXBQ ymm1, xmm2/m32来压缩LUT. (_mm256_cvtepi8_epi64).这使您的LUT大小为(1<< 4)= 16 * 4字节= 64B = 1个缓存行.不幸的是, 不适合用作内在函数的狭窄负载.

尤其是如果您已经将位图保存在一个整数寄存器(而不是内存)中,则vpmovsxbq LUT在内部循环中对于64位元素应该是出色的.或者,如果指令吞吐量或混洗吞吐量成为瓶颈,请使用未压缩的LUT.这样,您(或编译器)就可以将掩码向量用作其他内容的内存操作数,而无需单独的指令来加载它.


用于32位元素的LUT:可能不是最佳选择,但这是您可以做到的

对于32位元素,一个8位掩码提供了256个可能的向量,每个向量长8个元素. 256 * 8B = 2048字节,即使对于压缩版本(vpmovsxbd ymm, m64加载),这也是相当大的缓存占用空间.

要解决此问题,您可以将LUT分成4位块.将一个8位整数分成两个4位整数(mov/and/shr)需要大约3个整数指令.然后,使用未压缩的128b向量的LUT(对于32位元素大小),vmovdqa低半部分和vinserti128高半部分.您仍然可以压缩LUT,但我不建议您使用它,因为您需要vmovd/vpinsrd/vpmovsxbd,这是2个混洗(因此您可能会限制uop吞吐量).

或者在Intel上2x vpmovsxbd xmm, [lut + rsi*4] + vinserti128可能更糟.


ALU替代:适用于16/32/64位元素

当整个位图适合每个元素时:广播它,并使用选择器掩码和VPCMPEQ针对相同的常数进行广播(该常数可以在循环中多次使用时保留在寄存器中).

vpbroadcastd  ymm0,  dword [mask]
vpand         ymm0, ymm0,  setr_epi32(1<<0, 1<<1, 1<<2, 1<<3, ..., 1<<7)
vpcmpeqd      ymm0, ymm0,  [same constant]
      ; ymm0 =  (mask & bit) == bit
      ; where bit = 1<<element_number

掩码可以来自带有vmovd + vpbroadcastd的整数寄存器,但是如果广播负载已经存在于内存中,则它的广播负载很便宜,例如从遮罩数组应用于元素数组.实际上,我们只关心该dword的低8位,因为8x 32位元素= 32字节. (例如,您是从vmovmaskps获得的).对于16x 16位元素,使用16位掩码时,您需要vpbroadcastw.为了首先从16位整数向量中获得这样的掩码,您可以将vpacksswb两个向量放在一起(保留每个元素的符号位),vpermq将元素在车道内包装后按顺序排列,然后是vpmovmskb.

对于8位元素,您需要vpshufb vpbroadcastd结果才能将相关位放入每个字节.请参阅如何执行_mm256_movemask_epi8的逆运算(VPMOVMSKB)? .但是对于16位及更宽的元素,元素的数量小于或等于元素的宽度,因此广播负载是免费提供的. (16位广播负载确实要花费微融合的ALU洗牌uop,不像32位和64位广播负载完全在加载端口中处理一样.)

vpbroadcastd/q甚至不花费任何ALU运算符,它是在加载端口中完成的. (bw是load + shuffle).即使将您的掩码打包在一起(对于32或64位元素,每个字节占用一个字节),对于vpbroadcastd而不是vpbroadcastb,它可能仍然更有效. x & mask == mask检查不关心广播后每个元素的高字节中的垃圾.唯一担心的是缓存行/页面拆分.


如果只需要符号位,则进行可变移位(在Skylake上更便宜)

可变混合和蒙版加载/存储仅关心蒙版元素的符号位.

将8位掩码广播到dword元素后,这只有1个uop(在Skylake上).

vpbroadcastd  ymm0, dword [mask]

vpsllvd       ymm0, ymm0, [vec of 24, 25, 26, 27, 28, 29, 30, 31]  ; high bit of each element = corresponding bit of the mask

;vpsrad        ymm0, ymm0, 31                          ; broadcast the sign bit of each element to the whole element
;vpsllvd + vpsrad has no advantage over vpand / vpcmpeqb, so don't use this if you need all the bits set.

vpbroadcastd与从内存中加载一样便宜(在Intel CPU和Ryzen上根本没有ALU uop). (较窄的广播,例如vpbroadcastb y,mem在Intel上采用ALU随机播放,但在Ryzen上可能不行.)

在Haswell/Broadwell上,可变移位稍微贵一点(3 uops,有限的执行端口),但是与Skylake的立即计数移位一样便宜! (在端口0或1上为1 uop.)在Ryzen上,它们也仅为2 uops(任何256b操作的最小值),但是具有3c的延迟和每4c的吞吐量之一.

请参阅标签Wiki有关性能信息,尤其是 Agner Fog的insn表 .

对于64位元素,请注意算术右移仅在16位和32位元素大小中可用.如果要将整个元素设置为4位全零/全1,请使用其他策略. 64位元素.

具有内在函数:

__m256i bitmap2vecmask(int m) {
    const __m256i vshift_count = _mm256_set_epi32(24, 25, 26, 27, 28, 29, 30, 31);
    __m256i bcast = _mm256_set1_epi32(m);
    __m256i shifted = _mm256_sllv_epi32(bcast, vshift_count);  // high bit of each element = corresponding bit of the mask
    return shifted;

    // use _mm256_and and _mm256_cmpeq if you need all bits set.
    //return _mm256_srai_epi32(shifted, 31);             // broadcast the sign bit to the whole element
}

在循环内,根据循环中的指令组合,LUT可能值得缓存占用空间.尤其是对于64位元素的大小(缓存占用不多),甚至对于32位也是如此.


另一种选择(而不是变量移位)是使用BMI2将每个位解压缩为一个字节,然后将该掩码元素放在高位,然后vpmovsx:

; 8bit mask bitmap in eax, constant in rdi

pdep      rax, rax, rdi   ; rdi = 0b1000000010000000... repeating
vmovq     xmm0, rax
vpmovsxbd ymm0, xmm0      ; each element = 0xffffff80 or 0

; optional
;vpsrad    ymm0, ymm0, 8   ; arithmetic shift to get -1 or 0

如果整数寄存器中已经有掩码(无论如何都必须分别vmovq/vpbroadcastd),那么即使在变量计数移位便宜的Skylake上,这种方式也可能更好.

如果掩码从内存中开始,则另一种ALU方法(vpbroadcastd直接转换为矢量)可能会更好,因为广播负载非常便宜.

请注意,pdep在Ryzen上有6个相关的uops(18c延迟,18c吞吐量),因此即使您的掩码确实以整数regs开始,此方法在Ryzen上也是可怕的.

(将来的读者,可以随时使用此函数的内在版本进行编辑.编写asm会更容易,因为它的键入要少得多,而且asm助记符也更易于阅读(到处都不会出现愚蠢的_mm256_) )

The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element.

I would like to do the inverse: take a 32 (where only the 4, 8 or 32 least significant bits are meaningful), and get a __m256i where the most significant bit of each int8, int32 or int64 sized block is set to the original bit.

Basically, I want to go from a compressed bitmask to one that is usable as a mask by other AVX2 instructions (such as maskstore, maskload, mask_gather).

I couldn't quickly find an instruction that does it, so I am asking here. If there isn't one instruction with that functionality, is there a clever hack you can think of that achieves this in very few instructions?

My current method is to use a 256 element lookup table. I want to use this operation within a loop where not much else is happening, to speed it up. Note, I'm not too interested in long multi-instruction sequences or little loops that implement this operation.

解决方案

There is no single instruction in AVX2 or earlier. (AVX512 can use masks in bitmap form directly, and has an instruction to expand masks to vectors).


If you're loading the bitmap from memory, loading it straight into vector registers for an ALU strategy should work well.

If you have the bitmap as a computation result, then it will be in an integer register where you can use it as a LUT index easily, so that's a good choice if you're aiming for 64-bit elements. Otherwise probably still go ALU for 32-bit elements or smaller, instead of a giant LUT or doing multiple chunks.


We'll have to wait for AVX-512's mask registers before cheap conversion from integer bitmasks to vector masks are possible. (With kmovw k1, r/m16, which compilers generate implicitly for int => __mmask16). There's an AVX512 insn to set a vector from a mask (VPMOVM2D zmm1, k1, _mm512_movm_epi8/16/32/64, with other versions for different element sizes), but you generally don't need it since everything that used to use mask vectors now uses mask registers. Maybe if you want to count elements that meet some comparison condition? (where you'd use pcmpeqd / psubd to generate and accumulate the vector of 0 or -1 elements). But scalar popcnt on the mask results would be a better bet.

But note that vpmovm2d requires the mask to be in an AVX512 k0..7 mask register. Getting it there will take extra instructions unless it came from a vector compare result, and instructions that move into mask registers need a uop for port 5 on Intel Skylake-X and similar CPUs so this can be a bottleneck (especially if you do any shuffles). Especially if it starts in memory (loading a bitmap) and you only need the high bit of each element, you're probably still better off with a broadcast load + variable shift even if 256-bit and 512-bit AVX512 instructions are available.


For 64-bit elements, the mask only has 4 bits, so a lookup table is reasonable. You can compress the LUT by loading it with VPMOVSXBQ ymm1, xmm2/m32. (_mm256_cvtepi8_epi64). This gives you a LUT size of (1<<4) = 16 * 4 bytes = 64B = 1 cache line. Unfortunately, pmovsx is inconvenient to use as a narrow load with intrinsics.

Especially if you already have your bitmap in an integer register (instead of memory), a vpmovsxbq LUT should be excellent inside an inner loop for 64-bit elements. Or if instruction throughput or shuffle throughput is a bottleneck, use an uncompressed LUT. This can let you (or the compiler) use the mask vector as a memory operand for something else, instead of needing a separate instruction to load it.


LUT for 32-bit elements: probably not optimal but here's how you could do it

With 32-bit elements, an 8-bit mask gives you 256 possible vectors, each 8 elements long. 256 * 8B = 2048 bytes, which is a pretty big cache footprint even for the compressed version (load with vpmovsxbd ymm, m64).

To work around this, you can split the LUT into 4-bit chunks. It takes about 3 integer instructions to split up an 8-bit integer into two 4-bit integers (mov/and/shr). Then with an uncompressed LUT of 128b vectors (for 32-bit element size), vmovdqa the low half and vinserti128 the high half. You could still compress the LUT, but I wouldn't recommend it because you'll need vmovd / vpinsrd / vpmovsxbd, which is 2 shuffles (so you probably bottleneck on uop throughput).

Or 2x vpmovsxbd xmm, [lut + rsi*4] + vinserti128 is probably even worse on Intel.


ALU alternative: good for 16/32/64-bit elements

When the whole bitmap fits in each element: broadcast it, AND with a selector mask, and VPCMPEQ against the same constant (which can stay in a register across multiple uses of this in a loop).

vpbroadcastd  ymm0,  dword [mask]
vpand         ymm0, ymm0,  setr_epi32(1<<0, 1<<1, 1<<2, 1<<3, ..., 1<<7)
vpcmpeqd      ymm0, ymm0,  [same constant]
      ; ymm0 =  (mask & bit) == bit
      ; where bit = 1<<element_number

The mask could come from an integer register with vmovd + vpbroadcastd, but a broadcast-load is cheap if it's already in memory, e.g. from a mask array to apply to an array of elements. We actually only care about the low 8 bits of that dword because 8x 32-bit elements = 32 bytes. (e.g. that you got from vmovmaskps). With a 16-bit mask for 16x 16-bit elements, you need vpbroadcastw. To get such a mask in the first place from 16-bit integer vectors, you might vpacksswb two vectors together (which preserves the sign bit of each element), vpermq to put the elements into sequential order after in-lane pack, then vpmovmskb.

For 8-bit elements, you will need to vpshufb the vpbroadcastd result to get the relevant bit into each byte. See How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?. But for 16-bit and wider elements, the number of elements is <= the element width, so a broadcast-load does this for free. (16-bit broadcast loads do cost a micro-fused ALU shuffle uop, unlike 32 and 64-bit broadcast loads which are handled entirely in the load ports.)

vpbroadcastd/q doesn't even cost any ALU uops, it's done right in the load port. (b and w are load+shuffle). Even if there your masks are packed together (one per byte for 32 or 64-bit elements), it might still be more efficient to vpbroadcastd instead of vpbroadcastb. The x & mask == mask check doesn't care about garbage in the high bytes of each element after the broadcast. The only worry is cache-line / page splits.


Variable shift (cheaper on Skylake) if you need just the sign bit

Variable blends and masked loads/stores only care about the sign bit of the mask elements.

This is only 1 uop (on Skylake) once you have the 8-bit mask broadcast to dword elements.

vpbroadcastd  ymm0, dword [mask]

vpsllvd       ymm0, ymm0, [vec of 24, 25, 26, 27, 28, 29, 30, 31]  ; high bit of each element = corresponding bit of the mask

;vpsrad        ymm0, ymm0, 31                          ; broadcast the sign bit of each element to the whole element
;vpsllvd + vpsrad has no advantage over vpand / vpcmpeqb, so don't use this if you need all the bits set.

vpbroadcastd is as cheap as a load from memory (no ALU uop at all on Intel CPUs and Ryzen). (Narrower broadcasts, like vpbroadcastb y,mem take an ALU shuffle uop on Intel, but maybe not on Ryzen.)

The variable-shift is slightly expensive on Haswell/Broadwell (3 uops, limited execution ports), but as cheap as immediate-count shifts on Skylake! (1 uop on port 0 or 1.) On Ryzen they're also only 2 uops (the minimum for any 256b operation), but have 3c latency and one per 4c throughput.

See the tag wiki for perf info, especially Agner Fog's insn tables.

For 64-bit elements, note that arithmetic right shifts are only available in 16 and 32-bit element size. Use a different strategy if you want the whole element set to all-zero / all-one for 4 bits -> 64-bit elements.

With intrinsics:

__m256i bitmap2vecmask(int m) {
    const __m256i vshift_count = _mm256_set_epi32(24, 25, 26, 27, 28, 29, 30, 31);
    __m256i bcast = _mm256_set1_epi32(m);
    __m256i shifted = _mm256_sllv_epi32(bcast, vshift_count);  // high bit of each element = corresponding bit of the mask
    return shifted;

    // use _mm256_and and _mm256_cmpeq if you need all bits set.
    //return _mm256_srai_epi32(shifted, 31);             // broadcast the sign bit to the whole element
}

Inside a loop, a LUT might be worth the cache footprint, depending on the instruction mix in the loop. Especially for 64-bit element size where it's not much cache footprint, but possibly even for 32-bit.


Another option, instead of variable shift, is to use BMI2 to unpack each bit to a byte with that mask element in the high bit, then vpmovsx:

; 8bit mask bitmap in eax, constant in rdi

pdep      rax, rax, rdi   ; rdi = 0b1000000010000000... repeating
vmovq     xmm0, rax
vpmovsxbd ymm0, xmm0      ; each element = 0xffffff80 or 0

; optional
;vpsrad    ymm0, ymm0, 8   ; arithmetic shift to get -1 or 0

If you already have masks in an integer register (where you'd have to vmovq / vpbroadcastd separately anyway), then this way is probably better even on Skylake where variable-count shifts are cheap.

If your masks start in memory, the other ALU method (vpbroadcastd directly into a vector) is probably better, because broadcast-loads are so cheap.

Note that pdep is 6 dependent uops on Ryzen (18c latency, 18c throughput), so this method is horrible on Ryzen even if your masks do start in integer regs.

(Future readers, feel free to edit in an intrinsics version of this. It's easier to write asm because it's a lot less typing, and the asm mnemonics are easier to read (no stupid _mm256_ clutter all over the place).)

这篇关于英特尔avx2中的movemask指令是否有反指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆