寻找最小值/最大值.优化 [英] Finding min/max. Optimization

查看:79
本文介绍了寻找最小值/最大值.优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

.loop:

    movups xmm1, [esi+ecx*4]
    movaps xmm2, xmm0
    movaps xmm5, xmm1

    pcmpgtd xmm2, xmm1

    andps  xmm0, xmm2
    andnps xmm2, xmm1
    orps   xmm0, xmm2

    pcmpgtd xmm5, xmm3

    andps  xmm3, xmm5
    andnps xmm5, xmm1
    orps   xmm3, xmm5

.cond:
    add ecx, 4
    js .loop

这是在整数中查找最大值/最小值的基本循环.我的处理器是AMD K8.我无法计算周期,但我可以比较我的代码比我没有使用 SIMD 的朋友慢.我不明白为什么.这个循环不是最优的吗?你看到原因了吗?

This is a base loop to find max/min among ints. The my processor is AMD K8. I am not able to count cylces but I can compare that my code is slower than my friends who didn't use SIMD. I cannot understand why. Is it this loop not optimal? Do you see a cause?

推荐答案

K8 只有 64 位执行单元,所以每 128b 条指令被解码成 2 个 m-ops.此外,即使地址对齐,movups 也比 movaps 具有更多的 m-ops.(尽管根据 Agner Fog 的表格,它仍然具有与 movaps 相同的每 2 个周期的吞吐量.)

K8 only has 64bit execution units, so every 128b instruction is decoded into 2 m-ops. Also, movups is more m-ops than movaps even when the address is aligned. (Although according to Agner Fog's tables, it still has the same one per 2 cycle throughput as movaps.)

如果你在标量版本中使用了分支,并且 minmax 不经常变化,那么分支预测可以让它运行得非常快.

If you used branches in the scalar version, and the min and max don't change often, then branch prediction can make it run quite fast.

这是 SIMD 必须做更多工作以致实际上比标量慢的情况之一.尽管这个 SSE2 版本实际上可能比具有全宽向量单元的 CPU 上的标量更好,比如 K10 或 Merom.(或更新)

This is one of those cases where SIMD has to do so much more work that it's actually slower than scalar. Although this SSE2 version might actually be better than scalar on CPUs with full-width vector units, like K10 or Merom. (or newer)

当然,使用 SSE4.1 pmaxsd/pminsd 会获得更好的结果.

Of course, you'd get far better results with SSE4.1 pmaxsd/pminsd.

这篇关于寻找最小值/最大值.优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆