AVX2 中的 VPERMB 在哪里? [英] Where is VPERMB in AVX2?

查看:35
本文介绍了AVX2 中的 VPERMB 在哪里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AVX2 有很多好东西.例如,它有很多指令,它们比它们的前身更强大.以 VPERMD 为例:它允许您完全任意广播/将一个 256 位长的 32 位值向量混洗/置换到另一个向量中,在运行时可选择置换1.从功能上讲,这淘汰了大量现有的旧解包、广播、置换、洗牌和移位指令3.

AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of 32-bit values into another, with the permutation selectable at runtime1. Functionally, that obsoletes a whole slew of existing old unpack, broadcast, permute, shuffle and shift instructions3.

酷豆.

那么 VPERMB 在哪里?即,相同的指令,但处理字节大小的元素.或者,就此而言,对于 16 位元素,VPERMW 在哪里?涉足 x86 汇编一段时间后,很明显 SSE PSHUFB 指令几乎是有史以来最有用的指令之一.它可以进行任何可能的排列、广播或逐字节洗牌.此外,它还可以用于执行 16 个并行的 4 位 -> 8 位表查找2.

So where is VPERMB? I.e., the same instruction, but working on byte-sized elements. Or, for that matter, where is VPERMW, for 16-bit elements? Having dabbled in x86 assembly for some time, it is pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time. It can do any possible permutation, broadcast or byte-wise shuffle. Furthermore, it can also be used to do 16 parallel 4-bit -> 8-bit table lookups2.

不幸的是,PSHUFB 在 AVX2 中没有扩展为跨车道,因此它仅限于车道内行为.VPERM 指令能够进行cross shuffle(实际上,perm"和shuf"在指令助记符中似乎是同义词?) - 但是8位和16位版本被省略了?

Unfortunately, PSHUFB wasn't extended to be cross-lane in AVX2, so it is restricted to within-lane behavior. The VPERM instructions are able to do cross shuffle (in fact, "perm" and "shuf" seem to be synonyms in the instruction mnemonics?) - but the 8 and 16-bit versions were omitted?

似乎没有一种很好的方法来模拟这条指令,而您可以轻松地用较小宽度的 shuffle 模拟较大宽度的 shuffle(通常,它甚至是免费的:您只需要一个不同的掩码).

There doesn't even seem to be a good way to emulate this instruction, whereas you can easily emulate the larger-width shuffles with smaller-width ones (often, it's even free: you just need a different mask).

我毫不怀疑 Intel 意识到 PSHUFB 的广泛使用,因此自然会产生一个问题,即为什么 AVX2 中省略了字节变体.该操作本质上更难在硬件中实现吗?是否有编码限制迫使其省略?

I have no doubt that Intel is aware of the wide and heavy use of PSHUFB, so the question naturally arises as to why the byte variant was omitted in AVX2. Is the operation intrinsically harder to implement in hardware? Are there encoding restrictions forcing its omission?

1通过在运行时选择,我的意思是定义改组行为的掩码来自寄存器.这使得指令比采用立即洗牌掩码的早期变体更灵活一个数量级,就像 addinc 或变量移位更灵活比立即换班更灵活.

1By selectable at runtime, I mean that the mask that defines the shuffling behavior comes from a register. This makes the instruction an order of magnitude more flexible than the earlier variants that take an immediate shuffle mask, in the same way that add is more flexible than inc or a variable shift is more flexible than an immediate shift.

2或 AVX2 中的 32 个这样的查找.

2Or 32 such lookups in AVX2.

3如果旧指令的编码较短,或者避免从内存中加载掩码,但在功能上它们已被取代,则旧指令偶尔有用.

3The older instructions are occasionally useful if they have a shorter encoding, or avoid loading a mask from memory, but functionally they are superseded.

推荐答案

我 99% 确定主要因素是晶体管的实施成本.它显然非常有用,它不存在的唯一原因是实施成本必须超过显着收益.

I'm 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn't exist is that the implementation cost must outweigh the significant benefit.

不太可能出现编码空间问题;VEX 编码空间提供了很多空间.就像,真的很多,因为表示前缀组合的字段不是位字段,它是一个整数,其中大部分值未使用.

Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field that represents combinations of prefixes isn't a bit-field, it's an integer with most of the values unused.

不过,他们决定为 AVX512VBMI 实施它,在 AVX512BW 和 AVX512F 中提供更大的元素尺寸.也许他们意识到没有这个有多糟糕,并决定无论如何都要这样做.AVX512F 需要大量的芯片面积/晶体管来实现,以至于英特尔决定甚至不在零售台式机 CPU 中实现它 几代.

They decided to implement it for AVX512VBMI, though, with larger element sizes available in AVX512BW and AVX512F. Maybe they realized how much it sucked to not have this, and decided to do it anyway. AVX512F takes a lot of die area / transistors to implement, so much that Intel decided not to even implement it in retail desktop CPUs for a couple generations.

(部分原因是我认为现在很多可以利用全新指令集的代码被编写为在已知服务器上运行,而不是在客户端机器上使用运行时调度).

(Part of that is that I think these days a lot of code that can take advantage of brand new instruction sets is written to run on known servers, instead of runtime dispatching for use on client machines).

根据维基百科,AVX512VBMI 在 Cannonlake 之前不会出现,但之后我们将有 vpermi2b,进行64次并行查表来自 128B 表(2 个 zmm 向量)).Skylake Xeon 只会带来 vpermi2w 和更大的元素尺寸(AVX512F + AVX512BW).

According to Wikipedia, AVX512VBMI isn't coming until Cannonlake, but then we will have vpermi2b, which does 64 parallel table lookups from a 128B table (2 zmm vectors)). Skylake Xeon will only bring vpermi2w and larger element sizes (AVX512F + AVX512BW).

我很确定三十二个 32:1 多路复用器比八个 8:1 多路复用器贵很多,即使 8:1 多路复用器宽 4 倍.他们可以通过多个阶段的洗牌(而不是单个 32:1 的阶段)来实现它,因为跨车道洗牌有 3 个周期的时间预算来完成他们的工作.但仍然有很多晶体管.

I'm pretty sure that thirty two 32:1 muxers are a lot more expensive than eight 8:1 muxers, even if the 8:1 muxers are 4x wider. They could implement it with multiple stages of shuffling (rather than a single 32:1 stage), since lane-crossing shuffles get a 3-cycle time budget to get their work done. But still a lot of transistors.

我很想看到有硬件设计经验的人给出一个不那么随意的回答.我曾经在面包板上用 TTL 计数器芯片构建了一个数字计时器(还有 IIRC,在 TI-99/4A 上从 BASIC 读取计数器,即使在大约 20 年前它已经非常过时了),但仅此而已.

I'd love to see a less hand-wavy answer from someone with hardware design experience. I built a digital timer from TTL counter chips on a breadboard once (and IIRC, read out the counter from BASIC on a TI-99/4A which was very obsolete even ~20 years ago whe), but that's about it.

很明显,SSE PSHUFB 指令几乎是有史以来最有用的指令之一.

It's pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time.

是的.这是第一个变量洗牌,带有来自寄存器而不是立即数的控制掩码.根据 pcmpeqb/pmovmskb 结果从随机掩码的 LUT 中查找随机掩码可以做一些疯狂的强大的事情.@stgatilov 的 IPv4 dotted-quad -> int 转换器 是我最喜欢的 SIMD 技巧的例子之一.

Yup. It was the first variable-shuffle, with a control mask from a register instead of an immediate. Looking up a shuffle mask from a LUT of shuffle masks based on a pcmpeqb / pmovmskb result can do some crazy powerful things. @stgatilov's IPv4 dotted-quad -> int converter is one of my favourite examples of awesome SIMD tricks.

这篇关于AVX2 中的 VPERMB 在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆