AVX指令vxorpd和vpxor之间的区别 [英] Difference between the AVX instructions vxorpd and vpxor

查看:166
本文介绍了AVX指令vxorpd和vpxor之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据英特尔内在指南

  • vxorpd ymm, ymm, ymm:计算a和b中打包的双精度(64位)浮点元素的按位XOR,并将结果存储在dst中.
  • vpxor ymm, ymm, ymm:计算a和b中256位(代表整数数据)的按位XOR,并将结果存储在dst中.
  • vxorpd ymm, ymm, ymm: Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
  • vpxor ymm, ymm, ymm: Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.

两者之间有什么区别?在我看来,这两个指令都将对ymm寄存器的所有256位进行按位XOR.如果我将vxorpd用于整数数据(反之亦然),会有性能损失吗?

What is the difference between the two? It appears to me that both instructions would do a bitwise XOR on all 256 bits of the ymm registers. Is there any performance penalty if I use vxorpd for integer data (and vice versa)?

推荐答案

将一些评论合并为一个答案:

Combining some comments into an answer:

除了性能之外,它们具有相同的行为(我认为即使有内存参数:所有AVX指令都缺乏对齐要求).

Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions).

在通往Broadwell的Nehalem上,(V)PXOR可以在3个ALU执行端口p0/p1/p5上运行. (V)XORPS/D只能在p5上运行.

On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5.

某些CPU具有旁路延迟".在整数和FP域"之间. Agner Fog的微体系结构文档说,在SnB/IvB上,旁路延迟有时为零.例如当使用错误"消息时随机或布尔运算的类型.在Haswell上,他的示例表明orps在用于整数指令的结果时没有额外的延迟,但是por在用于addps的结果时具有额外的1个时钟延迟.

Some CPUs have a "bypass delay" between integer and FP "domains". Agner Fog's microarch docs say that on SnB / IvB, the bypass delay is sometimes zero. e.g. when using the "wrong" type of shuffle or boolean operation. On Haswell, his examples show that orps has no extra latency when used on the result of an integer instruction, but that por has an extra 1 clock of latency when used on the result of addps.

在Skylake上,FP布尔值可以在任何端口上运行,但是旁路延迟取决于它们碰巧在哪个端口上运行. (有关表格,请参阅英特尔的优化手册). FP数学运算之间的端口5没有旁路延迟,但是端口0或端口1却没有.由于FMA单元位于端口0和1上,因此uop发行阶段通常会将布尔值分配给FP重载代码中的port5,因为它可以看到很多uops排队等待p0/p1,但是p5不太忙. (如何准确计划x86 uops?).

On Skylake, FP booleans can run on any port, but bypass delay depends on which port they happened to run on. (See Intel's optimization manual for a table). Port5 has no bypass delay between FP math ops, but port 0 or port 1 do. Since the FMA units are on port 0 and 1, the uop issue stage will usually assign booleans to port5 in FP heavy code, because it can see that lots of uops are queued up for p0/p1 but p5 is less busy. (How are x86 uops scheduled, exactly?).

我建议不要为此担心.为Haswell和Skylake调音会很好.或者只对整数数据始终使用VPXOR,对FP数据始终使用VXORPS,Skylake会做得很好(但Haswell可能不行).

I'd recommend not worrying about this. Tune for Haswell and Skylake will do fine. Or just always use VPXOR on integer data and VXORPS on FP data, and Skylake will do fine (but Haswell might not).

在AMD推土机/打桩机/压路机上,没有"FP"字样.布尔操作的版本. (请参阅Agner Fog的microarch手册的第182页.)在执行单元之间转发数据存在延迟(ivec-> fp或fp-> ivec为1个周期,int-> ivec为10个周期(eax -> xmm0),对于ivec-> int.是8个周期(推土机上为8,10.对于movd/pinsrw/pextrw,在压路机上为4,5))因此,无论如何,您都无法避免AMD的旁路延迟通过使用适当的布尔insn. XORPS确实比PXORXORPD(非VEX版本.VEX版本全部占用4个字节)少占用一个字节.

On AMD Bulldozer / Piledriver / Steamroller there is no "FP" version of the boolean ops. (see pg. 182 of Agner Fog's microarch manual.) There's a delay for forwarding data between execution units (of 1 cycle for ivec->fp or fp->ivec, 10 cycles for int->ivec (eax -> xmm0), 8 cycles for ivec->int. (8,10 on bulldozer. 4, 5 on steamroller for movd/pinsrw/pextrw)) So anyway, you can't avoid the bypass delay on AMD by using the appropriate boolean insn. XORPS does take one less byte to encode than PXOR or XORPD (non-VEX version. VEX versions all take 4 bytes.)

无论如何,旁路延迟只是额外的延迟,不会降低吞吐量.如果这些操作不是您内循环中最长的dep链的一部分,或者您可以并行交错两次迭代(因此您一次有多个依赖链可以无序执行),则PXOR可能是要走的路.

In any case, bypass delays are just extra latency, not reduced throughput. If these ops aren't part of the longest dep chain in your inner loop, or if you can interleave two iterations in parallel (so you have multiple dependency chains going at once for out-of-order-execution), then PXOR may be the way to go.

在Skylake之前的Intel CPU上,压缩整数指令始终可以在比其浮点运算符更多的端口上运行,因此首选整数运算.

On Intel CPUs before Skylake, packed-integer instructions can always run on more ports than their floating-point counterparts, so prefer integer ops.

这篇关于AVX指令vxorpd和vpxor之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆