寻找最小值/最大值.优化 [英] Finding min/max. Optimization
问题描述
.loop:
movups xmm1, [esi+ecx*4]
movaps xmm2, xmm0
movaps xmm5, xmm1
pcmpgtd xmm2, xmm1
andps xmm0, xmm2
andnps xmm2, xmm1
orps xmm0, xmm2
pcmpgtd xmm5, xmm3
andps xmm3, xmm5
andnps xmm5, xmm1
orps xmm3, xmm5
.cond:
add ecx, 4
js .loop
这是在整数中查找最大值/最小值的基本循环.我的处理器是AMD K8.我无法计算周期,但我可以比较我的代码比我没有使用 SIMD 的朋友慢.我不明白为什么.这个循环不是最优的吗?你看到原因了吗?
This is a base loop to find max/min among ints. The my processor is AMD K8. I am not able to count cylces but I can compare that my code is slower than my friends who didn't use SIMD. I cannot understand why. Is it this loop not optimal? Do you see a cause?
推荐答案
K8 只有 64 位执行单元,所以每 128b 条指令被解码成 2 个 m-ops.此外,即使地址对齐,movups
也比 movaps
具有更多的 m-ops.(尽管根据 Agner Fog 的表格,它仍然具有与 movaps
相同的每 2 个周期的吞吐量.)
K8 only has 64bit execution units, so every 128b instruction is decoded into 2 m-ops. Also, movups
is more m-ops than movaps
even when the address is aligned. (Although according to Agner Fog's tables, it still has the same one per 2 cycle throughput as movaps
.)
如果你在标量版本中使用了分支,并且 min
和 max
不经常变化,那么分支预测可以让它运行得非常快.
If you used branches in the scalar version, and the min
and max
don't change often, then branch prediction can make it run quite fast.
这是 SIMD 必须做更多工作以致实际上比标量慢的情况之一.尽管这个 SSE2 版本实际上可能比具有全宽向量单元的 CPU 上的标量更好,比如 K10 或 Merom.(或更新)
This is one of those cases where SIMD has to do so much more work that it's actually slower than scalar. Although this SSE2 version might actually be better than scalar on CPUs with full-width vector units, like K10 or Merom. (or newer)
当然,使用 SSE4.1 pmaxsd
/pminsd
会获得远更好的结果.
Of course, you'd get far better results with SSE4.1 pmaxsd
/pminsd
.
这篇关于寻找最小值/最大值.优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!