AVX2中的VPERMB在哪里? [英] Where is VPERMB in AVX2?

查看:123
本文介绍了AVX2中的VPERMB在哪里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AVX2有很多好东西.例如,它具有大量的指令,它们的功能远比其前身强大得多.以 VPERMD :它可以让您完全随意地广播/随机播放/替换将256位长的32位值向量转换为另一个向量,并可以在运行时选择 1 进行排列.从功能上讲,这会淘汰大量现有的旧版解压缩,广播,置换,随机播放和移位指令 3 .

酷豆.

那么VPERMB在哪里?即,相同的指令,但适用于字节大小的元素.或者,对于16位元素,VPERMW在哪里?涉足x86汇编已有一段时间,很明显,SSE PSHUFB指令几乎是有史以来最有用的指令之一.它可以进行任何可能的排列,广播或按字节顺序的随机播放.此外,它还可以用于执行16个并行4位-> 8位表查找 2 .

不幸的是,PSHUFB在AVX2中没有扩展为跨车道,因此仅限于车道内行为. VPERM指令能够进行交叉改组(实际上,"perm"和"shuf"似乎是指令助记符的同义词吗?)-但是省略了8位和16位版本吗?

似乎甚至没有一种模拟此指令的好方法,而您可以轻松地用较小宽度的代码模拟较大宽度的代码(通常是免费的:您只需要一个不同的掩码)./p>

毫无疑问,英特尔知道PSHUFB的广泛使用,因此自然而然地产生了一个问题,即为什么在AVX2中省略了字节变体.操作本质上更难在硬件中实现吗?是否存在强制省略的编码限制?


1 通过在运行时可选择,我的意思是定义改组行为的掩码来自寄存器.这使得指令比采用立即混洗掩码的早期变体更灵活一个数量级,就像addinc更灵活或者变量移位比立即移位更灵活一样.

2 或在AVX2中进行32次这样的查找.

3 如果较旧的指令编码较短,或者避免从内存中加载掩码,但在功能上已被取代,则偶尔会有用.

解决方案

我确定99%的主要因素是晶体管的实现成本.显然这将非常有用,并且不存在的唯一原因是实施成本必须超过重大收益.

不太可能出现编码空间问题; VEX编码空间提供了很多空间.实际上,由于代表前缀组合的字段不是位字段,因此它是一个整数,其中大部分值未使用.

尽管如此,他们决定为AVX512VBMI实现它,并且在AVX512BW和AVX512F中提供更大的元素尺寸.也许他们意识到没有这个很烂,于是决定还是这么做. AVX512F需要大量的裸片面积/晶体管来实现,以至于英特尔决定甚至不在零售台式机CPU中实现它 vpermi2b ,它可以从128B表中进行64个并行表查找( 2个zmm向量)). Skylake Xeon仅会带来vpermi2w和更大的元素尺寸(AVX512F + AVX512BW).


我非常确定,即使8:1混合器的宽度是4倍,我也相信32个32:1混合器的价格要比八个8:1混合器贵得多.他们可以通过多个阶段的改组(而不是单个32:1阶段)来实现它,因为跨车道改组需要3个周期的时间预算才能完成工作.但是仍然有很多晶体管.

我很想看到有硬件设计经验的人的建议.我曾经在面包板上用TTL计数器芯片构建了一个数字计时器(然后IIRC在TI-99/4A上从BASIC读出了计数器,即使在20年前也已经过时了),但是仅此而已.


很明显,SSE PSHUFB 指令几乎是有史以来最有用的指令之一.

是的.这是第一个变量改组,带有来自寄存器而不是立即数的控制掩码.根据pcmpeqb/pmovmskb结果从LUT随机掩码中查找随机掩码可以做一些疯狂的强大事情. @stgatilov的IPv4点分四进制->整数转换器是我最喜欢的SIMD绝妙技巧之一.

AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of 32-bit values into another, with the permutation selectable at runtime1. Functionally, that obsoletes a whole slew of existing old unpack, broadcast, permute, shuffle and shift instructions3.

Cool beans.

So where is VPERMB? I.e., the same instruction, but working on byte-sized elements. Or, for that matter, where is VPERMW, for 16-bit elements? Having dabbled in x86 assembly for some time, it is pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time. It can do any possible permutation, broadcast or byte-wise shuffle. Furthermore, it can also be used to do 16 parallel 4-bit -> 8-bit table lookups2.

Unfortunately, PSHUFB wasn't extended to be cross-lane in AVX2, so it is restricted to within-lane behavior. The VPERM instructions are able to do cross shuffle (in fact, "perm" and "shuf" seem to be synonyms in the instruction mnemonics?) - but the 8 and 16-bit versions were omitted?

There doesn't even seem to be a good way to emulate this instruction, whereas you can easily emulate the larger-width shuffles with smaller-width ones (often, it's even free: you just need a different mask).

I have no doubt that Intel is aware of the wide and heavy use of PSHUFB, so the question naturally arises as to why the byte variant was omitted in AVX2. Is the operation intrinsically harder to implement in hardware? Are there encoding restrictions forcing its omission?


1By selectable at runtime, I mean that the mask that defines the shuffling behavior comes from a register. This makes the instruction an order of magnitude more flexible than the earlier variants that take an immediate shuffle mask, in the same way that add is more flexible than inc or a variable shift is more flexible than an immediate shift.

2Or 32 such lookups in AVX2.

3The older instructions are occasionally useful if they have a shorter encoding, or avoid loading a mask from memory, but functionally they are superseded.

解决方案

I'm 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn't exist is that the implementation cost must outweigh the significant benefit.

Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field that represents combinations of prefixes isn't a bit-field, it's an integer with most of the values unused.

They decided to implement it for AVX512VBMI, though, with larger element sizes available in AVX512BW and AVX512F. Maybe they realized how much it sucked to not have this, and decided to do it anyway. AVX512F takes a lot of die area / transistors to implement, so much that Intel decided not to even implement it in retail desktop CPUs for a couple generations.

(Part of that is that I think these days a lot of code that can take advantage of brand new instruction sets is written to run on known servers, instead of runtime dispatching for use on client machines).

According to Wikipedia, AVX512VBMI isn't coming until Cannonlake, but then we will have vpermi2b, which does 64 parallel table lookups from a 128B table (2 zmm vectors)). Skylake Xeon will only bring vpermi2w and larger element sizes (AVX512F + AVX512BW).


I'm pretty sure that thirty two 32:1 muxers are a lot more expensive than eight 8:1 muxers, even if the 8:1 muxers are 4x wider. They could implement it with multiple stages of shuffling (rather than a single 32:1 stage), since lane-crossing shuffles get a 3-cycle time budget to get their work done. But still a lot of transistors.

I'd love to see a less hand-wavy answer from someone with hardware design experience. I built a digital timer from TTL counter chips on a breadboard once (and IIRC, read out the counter from BASIC on a TI-99/4A which was very obsolete even ~20 years ago whe), but that's about it.


It's pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time.

Yup. It was the first variable-shuffle, with a control mask from a register instead of an immediate. Looking up a shuffle mask from a LUT of shuffle masks based on a pcmpeqb / pmovmskb result can do some crazy powerful things. @stgatilov's IPv4 dotted-quad -> int converter is one of my favourite examples of awesome SIMD tricks.

这篇关于AVX2中的VPERMB在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆