做混合SSE整数/浮点SIMD指令时,我得到的性能损失 [英] Do I get a performance penalty when mixing SSE integer/float SIMD instructions

查看:544
本文介绍了做混合SSE整数/浮点SIMD指令时,我得到的性能损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在内部函数的形式,最近不少二手x86的SIMD指令(SSE1234)。我发现无奈的是,上证所ISA有几个简单的指令只适用于浮筒或只为整数,但在理论上应该都同样执行。例如,float和double向量都有说明从一个地址加载128位向量的高64位( movhps movhpd 的),但有一个为整数没有这样的指令向量。

我的问题:

有没有使用浮动整数载体,如运算指令时,期待的性能损失的原因使用的 movhps 的将数据加载到一个整数向量?

我写了几个测试来检查,但我想他们的结果是不可靠的。这真的很难写一个正确的测试,探讨所有角落的情况下这样的事情,尤其是当指令调度是最有可能这里涉及。

相关问题:

其他的平凡类似的事情也有一些指令,这样做基本上是相同的。例如,我可以做位或,的 POR ORPS orpd 的。谁能解释什么是这些附加指令的目的是什么?我想这可能与适用于每一个指令不同的调度算法。


解决方案

从专家(显然不是我:P):<一href=\"http://www.agner.org/optimize/optimizing_assembly.pdf\">http://www.agner.org/optimize/optimizing_assembly.pdf [13.2使用与其他类型的数据比他们打算(118-119页)向量指令]:


  

有是使用了错误类型的一些处理器指令的处罚。这是
  因为处理器可以具有不同的数据总线或不同执行单元用于整数
  和浮点数据。整数之间移动数据和浮点单元可以
  一个或一个以上的时钟周期取决于在处理器上,如表13.2所示。

 处理器旁路延时,时钟周期
  英特尔酷睿2及更早版本1
  英特尔的Nehalem 2
  英特尔的Sandy Bridge,后来0-1
  英特尔凌动0
  AMD 2
  威盛Nano 2-3
表13.2。整数和浮点执行单元之间的数据旁路的延迟


I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform equally for both. For example, both float and double vectors have instructions to load higher 64bits of a 128-bit vector from an address (movhps, movhpd), but there's no such instruction for integer vectors.

My question:

Is there any reasons to expect a performance hit when using floating point instructions on integer vectors, e.g. using movhps to load data to an integer vector?

I wrote several tests to check that, but I suppose their results are not credible. It's really hard to write a correct test that explores all corner cases for such things, especially when the instruction scheduling is most probably involved here.

Related question:

Other trivially similar things also have several instructions that do basically the same. For example I can do bitwise OR with por, orps or orpd. Can anyone explain what's the purpose of these additional instructions? I guess this might be related to different scheduling algorithms applied to each instruction.

解决方案

From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:

There is a penalty for using the wrong type of instructions on some processors. This is because the processor may have different data buses or different execution units for integer and floating point data. Moving data between the integer and floating point units can take one or more clock cycles depending on the processor, as listed in table 13.2.

Processor                       Bypass delay, clock cycles 
  Intel Core 2 and earlier        1 
  Intel Nehalem                   2 
  Intel Sandy Bridge and later    0-1 
  Intel Atom                      0 
  AMD                             2 
  VIA Nano                        2-3 
Table 13.2. Data bypass delays between integer and floating point execution units 

这篇关于做混合SSE整数/浮点SIMD指令时,我得到的性能损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆