为什么 SSE scalar sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

查看：33 发布时间：2022/1/9 10:11:26 performance assembly floating-point x86 sse

本文介绍了为什么 SSE scalar sqrt(x) 比 rsqrt(x) * x 慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在分析我们在英特尔酷睿双核上的一些核心数学，在研究平方根的各种方法时，我发现了一些奇怪的东西:使用 SSE 标量运算，取倒数平方根更快并乘以得到 sqrt，而不是使用本机 sqrt 操作码！

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!

我正在使用类似以下的循环对其进行测试:

I'm testing it with a loop something like:

inline float TestSqrtFunction( float in );

void TestFunc()
{
  #define ARRAYSIZE 4096
  #define NUMITERS 16386
  float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
  float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache

  cyclecounter.Start();
  for ( int i = 0 ; i < NUMITERS ; ++i )
    for ( int j = 0 ; j < ARRAYSIZE ; ++j )
    {
       flOut[j] = TestSqrtFunction( flIn[j] );
       // unrolling this loop makes no difference -- I tested it.
    }
  cyclecounter.Stop();
  printf( "%d loops over %d floats took %.3f milliseconds",
          NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

我已经为 TestSqrtFunction 尝试了几个不同的主体，但我有一些时间确实让我摸不着头脑.到目前为止，最糟糕的是使用本机 sqrt() 函数并让智能"编译器优化".在 24ns/float 时，使用 x87 FPU 这太糟糕了:

I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

接下来我尝试使用内部函数强制编译器使用 SSE 的标量 sqrt 操作码:

The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
   _mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
   // compiles to movss, sqrtss, movss
}

这更好，为 11.9ns/float.我还尝试了 Carmack 古怪的 Newton-Raphson 近似技术，它的运行甚至比硬件还要好，在 4.3ns/float，虽然有 1 in 2¹⁰ 的错误(这对我的目的来说太多了).

This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 2¹⁰ (which is too much for my purposes).

当我尝试对 reciprocal 平方根进行 SSE 运算，然后使用乘法得到平方根 ( x * 1/√x = √x ) 时，这真是太棒了.尽管这需要两个相关的操作，但它是迄今为止最快的解决方案，1.24ns/float 并且精确到 2^-14:

The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2^-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
   __m128 in = _mm_load_ss( pIn );
   _mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
   // compiles to movss, movaps, rsqrtss, mulss, movss
}

我的问题基本上是什么给了?为什么 SSE 的硬件内置平方根操作码比从其他两个数学运算中合成它要慢?

My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?

我确定这确实是操作本身的成本，因为我已经验证:

I'm sure that this is really the cost of the op itself, because I've verified:

所有数据都适合缓存，并且访问是顺序的
函数是内联的
展开循环没有区别
编译器标志设置为完全优化(我检查过，程序集很好)

(edit: stephentyrone 正确指出对长字符串的操作应该使用矢量化 SIMD 打包操作，例如 rsqrtps — 但这里的数组数据结构是仅用于测试目的:我真正想要衡量的是标量用于无法矢量化的代码的性能.)

(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

为什么 SSE scalar sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么 SSE scalar sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

为什么 SSE scalar sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) x?*

登录关闭