旁路切换延迟执行单元时域 [英] Bypass delays when switching execution unit domains

查看:167
本文介绍了旁路切换延迟执行单元时域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图切换执行单元域时,可能理解旁路延迟。

例如,code以下两行完全相同的结果。

  _mm_add_ps(X,_mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(X),8)));
_mm_add_ps(X,_mm_shuffle_ps(_mm_setzero_ps()中,x,0X40));

code的哪条线是更好地使用?

该组件输出的第一行给出了:

  vpslldq将xmm1,XMM0,8
vaddps XMM0,xmm1中,XMM0

该组件输出的第二行给出:

  vshufps将xmm1,XMM0,XMMWORD PTR [RCX] 64; 00000040H
vaddps XMM2,将xmm1,XMMWORD PTR [RCX]

现在,如果我看瓦格纳雾的微架构手动他给了112页的例子使用对浮点值的整数洗牌(pshufd)与上浮点值使用浮点洗牌(SHUFPS)。切换域增加了4个额外的时钟周期,所以在使用SHUFPS解决方案更好。

我用 _mm_slli_si128 上市code的第一行至整数和浮点矢量之间切换域。第二个使用 _mm_shuffle_ps 停留在同一个域中。这是否意味着,code的第二行是更好的解决方案?


解决方案

第2.1.4节在Intel的优化指南表明您(和瓦格纳)说的很对在这个问题上 -


  

    

当在一个堆栈执行的微操作的一个来源是在另一个堆栈执行的微操作,可发生一个或两个周期的延迟。 的延迟也发生了英特尔SSE整数,SSE英特尔浮点运算之间TRAN-位数。


  

所以一般看来,你会更好相同的堆栈/域尽可能保持内。

当然标杆始终是preferred,而这一切都是值得的处理只在情况下,这确实是在code的瓶颈。

I'm trying to understand possibly bypass delays when switching domains of execution units.

For example, the following two lines of code give exactly the same result.

_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
_mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));

Which line of code is better to use?

The assembly output for the first line gives:

vpslldq xmm1, xmm0, 8
vaddps  xmm0, xmm1, xmm0

The assembly output for the second line gives:

vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64   ; 00000040H
vaddps  xmm2, xmm1, XMMWORD PTR [rcx]

Now if I look at Agner Fog's microarchitecture manual he gives an example on page 112 of using a integer shuffle (pshufd) on float values versus using a float shuffle (shufps) on float values. Switching domains adds 4 extra clock cycles so the solution using shufps is better.

The first line of code I listed using _mm_slli_si128 has to switch domains between integer and float vectors. The second using _mm_shuffle_ps stays in the same domain. Doesn't this imply that the second line of code is the better solution?

解决方案

Section 2.1.4 in the Intel optimization guide indicates that you (and Agner) are quite right on this matter -

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a one- or two-cycle delay can occur. The delay occurs also for tran-sitions between Intel SSE integer and Intel SSE floating-point operation.

So in general it seems you'd be better off keeping within the same stack/domain as much as possible.

Of course benchmarking is always preferred, and all this is worth handling only in case this is indeed a bottleneck in your code.

这篇关于旁路切换延迟执行单元时域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆