为何两者兼而有之? vperm2f128(avx)与vperm2i128(avx2) [英] Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

查看:89
本文介绍了为何两者兼而有之? vperm2f128(avx)与vperm2i128(avx2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

avx引入了指令vperm2f128(通过_mm256_permute2f128_si256公开),而avx2引入了vperm2i128(通过_mm256_permute2x128_si256公开).

avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256).

它们似乎都做的完全一样,它们各自的等待时间和吞吐量也似乎是相同的.

They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical.

那么为什么两个指令都存在?这背后必须有一些推理吗?也许我忽略了什么?鉴于avx2在avx引入的数据结构上运行,我无法想象会有一个支持avx2但不支持avx的处理器.

So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a processor will ever exist that supports avx2 but not avx.

推荐答案

内部函数和下面的实际指令之间有些脱节.

There's a bit of a disconnect between the intrinsics and the actual instructions that are underneath.

AVX:

所有这3个指令都生成完全相同的指令vperm2f128:

All 3 of these generate exactly the same instruction, vperm2f128:

  • _mm256_permute2f128_pd()
  • _mm256_permute2f128_ps()
  • _mm256_permute2f128_si256()
  • _mm256_permute2f128_pd()
  • _mm256_permute2f128_ps()
  • _mm256_permute2f128_si256()

唯一的区别是类型-在指令级别不存在.

The only difference are the types - which don't exist at the instruction level.

vperm2f128是256位浮点指令.在AVX中,没有真实的" 256位整数SIMD指令.因此,即使_mm256_permute2f128_si256()是内在的整数",它实际上也只是语法糖:

vperm2f128 is a 256-bit floating-point instruction. In AVX, there are no "real" 256-bit integer SIMD instructions. So even though _mm256_permute2f128_si256() is an "integer" intrinsic, it's really just syntax sugar for this:

_mm256_castpd_si256(
    _mm256_permute2f128_pd(
        _mm256_castsi256_pd(x),
        _mm256_castsi256_pd(y),
        imm
    )
);

哪个从整数域到FP域进行往返-从而导致旁路延迟.看起来很丑陋,但这是在仅限AVX的陆地上执行此操作的唯一方法.

Which does a round trip from the integer domain to the FP domain - thus incurring bypass delays. As ugly as this looks, it is only way to do it in AVX-only land.

vperm2f128不是唯一获得此治疗的指令,我发现其中至少有3个:

vperm2f128 isn't the only instruction to get this treatment, I find at least 3 of them:

  • vperm2f128/_mm256_permute2f128_si256()
  • vextractf128/_mm256_extractf128_si256()
  • vinsertf128/_mm256_insertf128_si256()
  • vperm2f128 / _mm256_permute2f128_si256()
  • vextractf128 / _mm256_extractf128_si256()
  • vinsertf128 / _mm256_insertf128_si256()

在一起,似乎这些内在函数的用例是将数据作为256位整数向量加载,并将它们混洗为多个128位整数向量以进行整数计算.同样,存储为256位向量的相反情况.

Together, it seems that the usecase of these intrinsics is to load data as 256-bit integer vectors, and shuffle them into multiple 128-bit integer vectors for integer computation. Likewise the reverse where you store as 256-bit vectors.

没有这些"hack"内部函数,您将需要使用很多强制转换内部函数.

Without these "hack" intrinsics, you would need to use a lot of cast intrinsics.

无论哪种方式,胜任的编译器也会尝试优化类型.因此,即使您使用的是256位整数加载,它也会生成浮点加载/存储和混洗.这将旁路延迟的数量减少到仅一层. (当您从FP-shuffle转到128位整数计算时)

Either way, a competent compiler will try to optimize the types as well. Thus it will generate floating-point load/stores and shuffles even if you are using 256-bit integer loads. This reduces the number of bypass delays to only one layer. (when you go from FP-shuffle to 128-bit integer computation)

AVX2:

AVX2通过为所有内容(包括随机播放)添加适当的256位整数SIMD支持来消除这种疯狂.

AVX2 cleans up this madness by adding proper 256-bit integer SIMD support for everything - including the shuffles.

vperm2i128指令是新的,还带有一个新的内在函数_mm256_permute2x128_si256().

The vperm2i128 instruction is new along with a new intrinsic for it, _mm256_permute2x128_si256().

这与_mm256_extracti128_si256()_mm256_inserti128_si256()一起使您可以执行256位整数SIMD,并且实际上完全停留在整数域中.

This, along with _mm256_extracti128_si256() and _mm256_inserti128_si256() lets you do 256-bit integer SIMD and actually stay completely in the integer domain.

同一指令的整数FP版本之间的区别与旁路延迟有关.在较旧的处理器中,从int-> FP域中移动数据存在延迟.尽管SIMD寄存器本身是类型无关的,但硬件实现不是.而且,通过FP指令将数据输出到整数指令的输入还有额外的延迟. (反之亦然)

The distinction between integer FP versions of the same instructions has to do with bypass delays. In older processors, there were delays to move data from int <-> FP domains. While the SIMD registers themselves are type-agnostic, the hardware implementation isn't. And there is extra latency to get data output by an FP instruction to an input to an integer instruction. (and vice versa)

因此,从性能的角度来看,重要的是使用正确的指令类型来匹配要操作的实际数据类型.

Thus it was important (from a performance standpoint) to use the correct instruction type to match the actual datatype that was being operated on.

在最新的处理器(Skylake和更高版本?)上,关于随机播放指令,似乎不再有int/FP旁路延迟.尽管指令集仍具有这种区别,但是现在使用不同的类型"执行相同操作的混洗指令现在可能映射到相同的uop.

On the newest processors (Skylake and later?), there doesn't seem to be anymore int/FP bypass delays with regards to the shuffle instructions. While the instruction set still has this distinction, shuffle instructions that do the same thing with different "types" probably map to the same uop now.

这篇关于为何两者兼而有之? vperm2f128(avx)与vperm2i128(avx2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆