为什么 SSE scalar sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

查看:33
本文介绍了为什么 SSE scalar sqrt(x) 比 rsqrt(x) * x 慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在分析我们在英特尔酷睿双核上的一些核心数学,在研究平方根的各种方法时,我发现了一些奇怪的东西:使用 SSE 标量运算,取倒数平方根更快并乘以得到 sqrt,而不是使用本机 sqrt 操作码!

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!

我正在使用类似以下的循环对其进行测试:

I'm testing it with a loop something like:

inline float TestSqrtFunction( float in );

void TestFunc()
{
  #define ARRAYSIZE 4096
  #define NUMITERS 16386
  float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
  float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache

  cyclecounter.Start();
  for ( int i = 0 ; i < NUMITERS ; ++i )
    for ( int j = 0 ; j < ARRAYSIZE ; ++j )
    {
       flOut[j] = TestSqrtFunction( flIn[j] );
       // unrolling this loop makes no difference -- I tested it.
    }
  cyclecounter.Stop();
  printf( "%d loops over %d floats took %.3f milliseconds",
          NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

我已经为 TestSqrtFunction 尝试了几个不同的主体,但我有一些时间确实让我摸不着头脑.到目前为止,最糟糕的是使用本机 sqrt() 函数并让智能"编译器优化".在 24ns/float 时,使用 x87 FPU 这太糟糕了:

I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

接下来我尝试使用内部函数强制编译器使用 SSE 的标量 sqrt 操作码:

The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
   _mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
   // compiles to movss, sqrtss, movss
}

这更好,为 11.9ns/float.我还尝试了 Carmack 古怪的 Newton-Raphson 近似技术,它的运行甚至比硬件还要好,在 4.3ns/float,虽然有 1 in 210 的错误(这对我的目的来说太多了).

This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 210 (which is too much for my purposes).

当我尝试对 reciprocal 平方根进行 SSE 运算,然后使用乘法得到平方根 ( x * 1/√x = √x ) 时,这真是太棒了.尽管这需要两个相关的操作,但它是迄今为止最快的解决方案,1.24ns/float 并且精确到 2-14:

The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
   __m128 in = _mm_load_ss( pIn );
   _mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
   // compiles to movss, movaps, rsqrtss, mulss, movss
}

我的问题基本上是什么给了?为什么 SSE 的硬件内置平方根操作码比从其他两个数学运算中合成它要慢?

My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?

我确定这确实是操作本身的成本,因为我已经验证:

I'm sure that this is really the cost of the op itself, because I've verified:

  • 所有数据都适合缓存,并且访问是顺序的
  • 函数是内联的
  • 展开循环没有区别
  • 编译器标志设置为完全优化(我检查过,程序集很好)

(edit: stephentyrone 正确指出对长字符串的操作应该使用矢量化 SIMD 打包操作,例如 rsqrtps — 但这​​里的数组数据结构是仅用于测试目的:我真正想要衡量的是 标量 用于无法矢量化的代码的性能.)

(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

推荐答案

sqrtss 给出正确舍入的结果.rsqrtss 给出倒数的近似值,精确到大约 11 位.

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.

sqrtss 正在生成更准确的结果,以便在需要准确性时使用.rsqrtss 适用于近似值足够但需要速度的情况.如果您阅读 Intel 的文档,您还会发现一个指令序列(倒数平方根近似,后跟一个 Newton-Raphson 步骤),它提供了几乎完整的精度(如果我没记错的话,精度约为 23 位),并且仍然有点比 sqrtss 更快.​​

sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.

如果速度很关键,并且您确实在循环中为许多值调用它,您应该使用这些指令的矢量化版本,rsqrtpssqrtps,每条指令都处理四个浮点数.

edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.

这篇关于为什么 SSE scalar sqrt(x) 比 rsqrt(x) * x 慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆