vextracti128和vextractf128有什么区别? [英] What's the difference between vextracti128 and vextractf128?

查看:174
本文介绍了vextracti128和vextractf128有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

vextracti128vextractf128具有相同的功能,参数和返回值.另外,一个是AVX指令集,另一个是AVX2.有什么区别?

vextracti128 and vextractf128 have the same functionality, parameters, and return values. In addition one is AVX instruction set while the other is AVX2. What is the difference?

推荐答案

vextracti128vextractf128不仅具有相同的功能,参数和返回值.它们具有相同的指令长度.并且它们具有相同的吞吐量(根据Agner Fog的优化手册).

vextracti128 and vextractf128 have not only the same functionality, parameters, and return values. They have the same instruction length. And they have the same throughput (according to Agner Fog's optimization manuals).

还不清楚它们的等待时间值(在具有依赖链的紧密循环中的性能).指令本身的延迟为3个周期.但是,在阅读了《英特尔优化手册》第2.1.3节(执行引擎")之后,我们可能会怀疑vextracti128在使用浮点数据时应获得额外的1个时钟延迟,而vextractf128在使用整数时应获得额外的1个时钟延迟.数据.测量表明这是不正确的,并且延迟总是精确地保持3个周期(至少对于Haswell处理器而言).据我所知,《优化手册》中没有任何记录.

What is not completely clear is their latency values (performance in tight loops with dependency chains). Latency of instructions themselves is 3 cycles. But after reading section 2.1.3 ("Execution Engine") of Intel Optimization Manual we may suspect that vextracti128 should get additional 1 clock delay when working with floating point data and vextractf128 should get additional 1 clock delay when working with integer data. Measurements show that this is not true and latency always remains exactly 3 cycles (at least for Haswell processors). And as far as I know this is not documented anywhere in the Optimization Manual.

静止指令集仅是处理器的接口. Haswell是此接口的唯一实现,同时包含这两个指令.我们可以忽略以下事实:这些指令的实现(很可能)是相同的.并按预期使用这些指令-vextracti128用于整数数据,vextractf128用于FP数据. (如果我们只需要对数据重新排序而不执行任何int/FP操作,则显而易见的选择是vextractf128,因为它由多个较旧的处理器支持).经验还表明,英特尔有时会降低下一代CPU中某些指令的性能,因此明智的做法是观察这些指令的亲和力,以免将来出现任何速度下降的情况.

Still instruction set is only an interface to processor. Haswell is the only implementation of this interface containing both these instructions (for now). We could ignore the fact that implementations of these instructions are (most likely) identical. And use these instructions as intended - vextracti128 for integer data and vextractf128 for FP data. (If we only need to reorder data without performing any int/FP operations, the obvious choice is vextractf128 as it is supported by several older processors). Also experience shows that Intel sometimes decreases performance of some instructions in next generations of CPUs, so it would be wise to observe these instructions' affinity to avoid any possible speed degradation in the future.

由于英特尔优化手册对SIMD指令的int/FP域之间的关系不是很详细,因此,我(在Haswell上)进行了更多测量,并得出了一些有趣的结果:

Since Intel Optimization Manual is not very detailed describing relationship between int/FP domains for SIMD instructions, I've made some more measurements (on Haswell) and got some interesting results:

SSE整数和随机播放指令之间的任何转换都没有额外的延迟. SSE FP和随机播放指令之间的任何转换都没有额外的延迟. (尽管我没有测试每条指令).例如,您可以在两个FP指令之间插入诸如pshufb之类的明显整数"指令,而不会产生额外的延迟.在整数代码的中间插入shufpd也不会有额外的延迟.

There is no additional delay for any transitions between SSE integer and shuffle instructions. And there is no additional delay for any transitions between SSE FP and shuffle instructions. (Though I didn't test every instruction). For example you could insert such "obviously integer" instruction as pshufb between two FP instructions with no extra delay. Inserting shufpd in the middle of integer code also gives no extra delay.

由于vextracti128vextractf128是由shuffle单元执行的,因此它们也具有无延迟"属性.

Since vextracti128 and vextractf128 are executed by shuffle unit, they also have this "no delay" property.

这可能对优化int + FP混合代码很有用.如果您需要将FP数据重新解释为整数,并同时对寄存器进行随机播放,只需确保所有FP指令都在随机播放之前,而所有整数指令都在随机播放之后.

This may be useful to optimize mixed int+FP code. If you need to reinterpret FP data as integers and at the same time shuffle the register, just make sure all FP instructions stand before the shuffle and all integer instructions are after it.

andps和其他FP逻辑指令还具有忽略FP/int域的属性.

andps and other FP logical instructions also have the property of ignoring FP/int domains.

如果将整数逻辑指令(如pand)添加到FP代码中,则会获得额外的2个循环延迟(一个延迟到达int域,另一个延迟回到FP).因此,SIMD FP代码的明显选择是andps.相同的andps可以在整数代码中间使用,没有任何延迟.更好的是在int和FP指令之间使用此类指令.有趣的是,FP逻辑指令使用的端口号与所有随机播放指令的端口号相同.

If you add integer logical instruction (like pand) into FP code, you get additional 2 cycle delay (one to get to int domain and other one to get back to FP). So the obvious choice for SIMD FP code is andps. The same andps may be used in the middle of integer code without any delays. Even better is to use such instructions right in between int and FP instructions. Interestingly, FP logical instructions are using the same port number 5 as all shuffle instructions.

《英特尔优化手册》描述了生产者和消费者微操作之间的旁路延迟.但这没有说微操作如何与寄存器交互.

Intel Optimization Manual describes bypass delays between producer and consumer micro-ops. But it does not say anything how micro-ops interact with registers.

这段代码每次迭代仅需要3个时钟(正如vaddps所要求的那样):

This piece of code needs only 3 clocks per iteration (just as required by vaddps):

    vxorps ymm7, ymm7, ymm7
_benchloop:
    vaddps ymm0, ymm0, ymm7
    jmp _benchloop

但是此迭代每次需要2个时钟(比vpaddd所需的时钟多1个时钟):

But this one needs 2 clocks per iteration (1 more than needed for vpaddd):

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpaddd ymm0, ymm0, ymm7
    jmp _benchloop

这里唯一的区别是在整数域而不是FP域中进行计算.要获得1个时钟/迭代,我们需要添加一条指令:

The only difference here are calculations in integer domain instead of FP domain. To get 1 clock/iteration we need to add an instruction:

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpand ymm6, ymm7, ymm7
    vpaddd ymm0, ymm0, ymm6
    jmp _benchloop

这暗示(1)SIMD寄存器中存储的所有值都属于FP域,并且(2)从SIMD寄存器中读取会使整数运算的延迟增加一倍. (此处的{ymm0,ymm6}和ymm7之间的区别在于ymm7存储在一些暂存存储器中,并用作真正的寄存器",而ymm0和ymm6是临时的,并由内部CPU互连的状态表示,而不是某些永久存储,因此ymm0和ymm6不会被读取",而只是在微操作之间传递.

Which hints that (1) all values stored in SIMD registers belong to FP domain, and (2) reading from SIMD register increases integer operation's latency by one. (The difference between {ymm0, ymm6} and ymm7 here is that ymm7 is stored in some scratch memory and works as real "register" while ymm0 and ymm6 are temporary and are represented by state of internal CPU's interconnections rather than some permanent storage, so ymm0 and ymm6 are not "read" but just passed between micro-ops).

这篇关于vextracti128和vextractf128有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆