旁路切换延迟执行单元时域 [英] Bypass delays when switching execution unit domains
问题描述
我试图切换执行单元域时,可能理解旁路延迟。
例如,code以下两行完全相同的结果。
_mm_add_ps(X,_mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(X),8)));
_mm_add_ps(X,_mm_shuffle_ps(_mm_setzero_ps()中,x,0X40));
code的哪条线是更好地使用?
该组件输出的第一行给出了:
vpslldq将xmm1,XMM0,8
vaddps XMM0,xmm1中,XMM0
该组件输出的第二行给出:
vshufps将xmm1,XMM0,XMMWORD PTR [RCX] 64; 00000040H
vaddps XMM2,将xmm1,XMMWORD PTR [RCX]
现在,如果我看瓦格纳雾的微架构手动他给了112页的例子使用对浮点值的整数洗牌(pshufd)与上浮点值使用浮点洗牌(SHUFPS)。切换域增加了4个额外的时钟周期,所以在使用SHUFPS解决方案更好。
我用 _mm_slli_si128
上市code的第一行至整数和浮点矢量之间切换域。第二个使用 _mm_shuffle_ps
停留在同一个域中。这是否意味着,code的第二行是更好的解决方案?
第2.1.4节在Intel的优化指南表明您(和瓦格纳)说的很对在这个问题上 -
当在一个堆栈执行的微操作的一个来源是在另一个堆栈执行的微操作,可发生一个或两个周期的延迟。 的延迟也发生了英特尔SSE整数,SSE英特尔浮点运算之间TRAN-位数。
块引用>
块引用>
所以一般看来,你会更好相同的堆栈/域尽可能保持内。
当然标杆始终是preferred,而这一切都是值得的处理只在情况下,这确实是在code的瓶颈。
I'm trying to understand possibly bypass delays when switching domains of execution units.
For example, the following two lines of code give exactly the same result.
_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8))); _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
Which line of code is better to use?
The assembly output for the first line gives:
vpslldq xmm1, xmm0, 8 vaddps xmm0, xmm1, xmm0
The assembly output for the second line gives:
vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64 ; 00000040H vaddps xmm2, xmm1, XMMWORD PTR [rcx]
Now if I look at Agner Fog's microarchitecture manual he gives an example on page 112 of using a integer shuffle (pshufd) on float values versus using a float shuffle (shufps) on float values. Switching domains adds 4 extra clock cycles so the solution using shufps is better.
The first line of code I listed using
_mm_slli_si128
has to switch domains between integer and float vectors. The second using_mm_shuffle_ps
stays in the same domain. Doesn't this imply that the second line of code is the better solution?解决方案Section 2.1.4 in the Intel optimization guide indicates that you (and Agner) are quite right on this matter -
When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a one- or two-cycle delay can occur. The delay occurs also for tran-sitions between Intel SSE integer and Intel SSE floating-point operation.
So in general it seems you'd be better off keeping within the same stack/domain as much as possible.
Of course benchmarking is always preferred, and all this is worth handling only in case this is indeed a bottleneck in your code.
这篇关于旁路切换延迟执行单元时域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!