为什么SSE标量的sqrt(x)的比rsqrt(X)* X慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

查看:249
本文介绍了为什么SSE标量的sqrt(x)的比rsqrt(X)* X慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在分析我们的一些核心数学上的英特尔酷睿双核,在注视各种方法来平方根我注意到一个奇怪的现象:在使用SSE标量运算,它是更快采取倒数平方根并乘以得到开方,是不是比使用本地开方运算code!

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!

我与一个循环类似测试它:

I'm testing it with a loop something like:

inline float TestSqrtFunction( float in );

void TestFunc()
{
  #define ARRAYSIZE 4096
  #define NUMITERS 16386
  float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
  float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache

  cyclecounter.Start();
  for ( int i = 0 ; i < NUMITERS ; ++i )
    for ( int j = 0 ; j < ARRAYSIZE ; ++j )
    {
       flOut[j] = TestSqrtFunction( flIn[j] );
       // unrolling this loop makes no difference -- I tested it.
    }
  cyclecounter.Stop();
  printf( "%d loops over %d floats took %.3f milliseconds",
          NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

我已经为TestSqrtFunction几个不同的机构想这一点,和我有一些计时的真的抓我的头。最糟糕的是到目前为止使用本地的sqrt()函数,并让智能编译优化的。在24ns /浮点型,使用的x87 FPU,这是坏得可怜:

I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

我试图用一种内在强制编译器使用上证所标量运算开方code接下来的事情:

The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
   _mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
   // compiles to movss, sqrtss, movss
}

这是更好的,在11.9ns /浮动。我也试过卡马克的古怪牛顿Rhapson逼近技术,跑比硬件更好,在4.3ns /浮动,虽然与1比2的误差 10 (这是太多了我的目的)。

This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Rhapson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 210 (which is too much for my purposes).

该讲给是,当我试图上证所运为的倒数的平方根,然后用一个乘法来获得平方根(X * 1 /&拉迪奇; X =拉迪奇; X)。尽管这需要两个相关的操作,这是最快的解决方案到目前为止,在1.24ns /浮子和精确到2 -14

The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
   __m128 in = _mm_load_ss( pIn );
   _mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
   // compiles to movss, movaps, rsqrtss, mulss, movss
}

我的问题是基本的什么让的? 为什么上证所内置到硬件平方根运算code的的综合比它另外两个数学运算的?

My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?

我敢肯定,这确实是运算本身的成本,因为我已经验证:

I'm sure that this is really the cost of the op itself, because I've verified:


  • 中的所有数据高速缓存配合,并
    访问是连续

  • 的内联函数

  • 展开循环没什么区别

  • 编译器标志被设置为全面优化(与组装好,我检查)

修改:stephentyrone正确地指出,对数字的长串的操作应该使用向量化SIMD包装OPS,如 RSQRTPS &MDASH;但这里的数组仅用于测试目的:什么我真的试图衡量是的的性能为code,不能使用矢量)

(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

推荐答案

sqrtss 给出了一个正确舍入的结果。 rsqrtss 给出的逼近应用于倒数,精确到大约11位。

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.

sqrtss 正在产生一个更为精确的结果,当需要精度。 rsqrtss 存在的情况下,当一个近似值就足够了,但需要的速度。如果你看过Intel的文档,你还会发现,让快满precision(〜精度23位,如果我记得正确)的指令序列(倒数平方根近似后跟一个牛顿迭代步骤),仍比 sqrtss 稍快。

sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.

编辑:如果速度是至关重要的,你实际上调用这个在许多价值循环,你应该使用这些指令的矢量版本, RSQRTPS sqrtps ,这两者的处理每条指令四个浮点。

edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.

这篇关于为什么SSE标量的sqrt(x)的比rsqrt(X)* X慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆