为什么SSE标量的sqrt（x）的比rsqrt（X）* X慢？ [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

查看：249 发布时间：2016/7/18 19:44:02 performance assembly floating-point x86 sse

本文介绍了为什么SSE标量的sqrt（x）的比rsqrt（X）* X慢？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在分析我们的一些核心数学上的英特尔酷睿双核，在注视各种方法来平方根我注意到一个奇怪的现象：在使用SSE标量运算，它是更快采取倒数平方根并乘以得到开方，是不是比使用本地开方运算code！

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!

我与一个循环类似测试它：

I'm testing it with a loop something like:

inline float TestSqrtFunction( float in );

void TestFunc()
{
  #define ARRAYSIZE 4096
  #define NUMITERS 16386
  float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
  float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache

  cyclecounter.Start();
  for ( int i = 0 ; i < NUMITERS ; ++i )
    for ( int j = 0 ; j < ARRAYSIZE ; ++j )
    {
       flOut[j] = TestSqrtFunction( flIn[j] );
       // unrolling this loop makes no difference -- I tested it.
    }
  cyclecounter.Stop();
  printf( "%d loops over %d floats took %.3f milliseconds",
          NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

我已经为TestSqrtFunction几个不同的机构想这一点，和我有一些计时的真的抓我的头。最糟糕的是到目前为止使用本地的sqrt（）函数，并让智能编译优化的。在24ns /浮点型，使用的x87 FPU，这是坏得可怜：

I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

我试图用一种内在强制编译器使用上证所标量运算开方code接下来的事情：

The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
   _mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
   // compiles to movss, sqrtss, movss
}

这是更好的，在11.9ns /浮动。我也试过卡马克的古怪牛顿Rhapson逼近技术，跑比硬件更好，在4.3ns /浮动，虽然与1比2的误差¹⁰（这是太多了我的目的）。

This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Rhapson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 2¹⁰ (which is too much for my purposes).

该讲给是，当我试图上证所运为的倒数的平方根，然后用一个乘法来获得平方根（X * 1 /＆拉迪奇; X =拉迪奇; X）。尽管这需要两个相关的操作，这是最快的解决方案到目前为止，在1.24ns /浮子和精确到2 ^-14：

The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2^-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
   __m128 in = _mm_load_ss( pIn );
   _mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
   // compiles to movss, movaps, rsqrtss, mulss, movss
}

我的问题是基本的什么让的？ 为什么上证所内置到硬件平方根运算code的慢的综合比它另外两个数学运算的？

My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?

我敢肯定，这确实是运算本身的成本，因为我已经验证：

I'm sure that this is really the cost of the op itself, because I've verified:

中的所有数据高速缓存配合，并
访问是连续

的内联函数

展开循环没什么区别

编译器标志被设置为全面优化（与组装好，我检查）

（修改：stephentyrone正确地指出，对数字的长串的操作应该使用向量化SIMD包装OPS，如 RSQRTPS ＆MDASH;但这里的数组仅用于测试目的：什么我真的试图衡量是的标的性能为code，不能使用矢量）

(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

为什么SSE标量的sqrt（x）的比rsqrt（X）* X慢？ [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

问题描述

推荐答案

相关文章

.NET Framework最新文章

热门教程

热门工具

登录关闭

为什么SSE标量的sqrt（x）的比rsqrt（X）* X慢？ [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

问题描述

推荐答案

相关文章

.NET Framework最新文章

热门教程

热门工具

登录 关闭

为什么SSE标量的sqrt（x）的比rsqrt（X）* X慢？ [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

登录关闭