为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

查看：29 发布时间：2021/12/8 11:48:18 performance assembly floating-point x86 sse

本文介绍了为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在英特尔酷睿双核上分析我们的一些核心数学，在查看平方根的各种方法时，我注意到一些奇怪的事情:使用 SSE 标量运算，取倒数平方根更快乘以得到 sqrt，而不是使用原生 sqrt 操作码！

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!

我正在用如下循环测试它:

I'm testing it with a loop something like:

inline float TestSqrtFunction( float in );

void TestFunc()
{
  #define ARRAYSIZE 4096
  #define NUMITERS 16386
  float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
  float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache

  cyclecounter.Start();
  for ( int i = 0 ; i < NUMITERS ; ++i )
    for ( int j = 0 ; j < ARRAYSIZE ; ++j )
    {
       flOut[j] = TestSqrtFunction( flIn[j] );
       // unrolling this loop makes no difference -- I tested it.
    }
  cyclecounter.Stop();
  printf( "%d loops over %d floats took %.3f milliseconds",
          NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

我已经为 TestSqrtFunction 尝试了几个不同的主体，我有一些时间真的让我抓狂.迄今为止最糟糕的是使用原生 sqrt() 函数并让智能"编译器优化".在 24ns/float 时，使用 x87 FPU 这非常糟糕:

I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

接下来我尝试使用内部函数来强制编译器使用 SSE 的标量 sqrt 操作码:

The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
   _mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
   // compiles to movss, sqrtss, movss
}

这更好，为 11.9ns/float.我还尝试了 Carmack 古怪的 Newton-Raphson 近似技术，它的运行甚至比硬件还要好，在 4.3ns/float，虽然有 1 in 2¹⁰ 的错误(这对我的目的来说太多了).

This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 2¹⁰ (which is too much for my purposes).

最糟糕的是，当我尝试对 reciprocal 平方根进行 SSE 运算，然后使用乘法得到平方根 ( x * 1/√x = √x ) 时.尽管这需要两个相关操作，但它是迄今为止最快的解决方案，为 1.24ns/float，精确到 2^-14:

The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2^-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
   __m128 in = _mm_load_ss( pIn );
   _mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
   // compiles to movss, movaps, rsqrtss, mulss, movss
}

我的问题基本上是什么给了?为什么 SSE 的硬件内置平方根操作码比从其他两个数学运算中合成它要慢?

My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?

我确定这确实是操作本身的成本，因为我已经验证过:

I'm sure that this is really the cost of the op itself, because I've verified:

所有数据都适合缓存，并且访问是顺序的
内联函数
展开循环没有区别
编译器标志设置为完全优化(我检查过程序集很好)

(edit: stephentyrone 正确地指出对长数字串的操作应该使用向量化 SIMD 打包操作，比如 rsqrtps — 但这里的数组数据结构是仅用于测试目的:我真正想衡量的是在无法矢量化的代码中使用的标量性能.)

(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

登录关闭