为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢? [英] Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

查看:29
本文介绍了为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在英特尔酷睿双核上分析我们的一些核心数学,在查看平方根的各种方法时,我注意到一些奇怪的事情:使用 SSE 标量运算,取倒数平方根更快乘以得到 sqrt,而不是使用原生 sqrt 操作码!

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!

我正在用如下循环测试它:

I'm testing it with a loop something like:

inline float TestSqrtFunction( float in );

void TestFunc()
{
  #define ARRAYSIZE 4096
  #define NUMITERS 16386
  float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
  float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache

  cyclecounter.Start();
  for ( int i = 0 ; i < NUMITERS ; ++i )
    for ( int j = 0 ; j < ARRAYSIZE ; ++j )
    {
       flOut[j] = TestSqrtFunction( flIn[j] );
       // unrolling this loop makes no difference -- I tested it.
    }
  cyclecounter.Stop();
  printf( "%d loops over %d floats took %.3f milliseconds",
          NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

我已经为 TestSqrtFunction 尝试了几个不同的主体,我有一些时间真的让我抓狂.迄今为止最糟糕的是使用原生 sqrt() 函数并让智能"编译器优化".在 24ns/float 时,使用 x87 FPU 这非常糟糕:

I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

接下来我尝试使用内部函数来强制编译器使用 SSE 的标量 sqrt 操作码:

The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
   _mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
   // compiles to movss, sqrtss, movss
}

这更好,为 11.9ns/float.我还尝试了 Carmack 古怪的 Newton-Raphson 近似技术,它的运行甚至比硬件还要好,在 4.3ns/float,虽然有 1 in 210 的错误(这对我的目的来说太多了).

This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 210 (which is too much for my purposes).

最糟糕的是,当我尝试对 reciprocal 平方根进行 SSE 运算,然后使用乘法得到平方根 ( x * 1/√x = √x ) 时.尽管这需要两个相关操作,但它是迄今为止最快的解决方案,为 1.24ns/float,精确到 2-14:

The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
   __m128 in = _mm_load_ss( pIn );
   _mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
   // compiles to movss, movaps, rsqrtss, mulss, movss
}

我的问题基本上是什么给了?为什么 SSE 的硬件内置平方根操作码 比从其他两个数学运算中合成它要慢?

My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?

我确定这确实是操作本身的成本,因为我已经验证过:

I'm sure that this is really the cost of the op itself, because I've verified:

  • 所有数据都适合缓存,并且访问是顺序的
  • 内联函数
  • 展开循环没有区别
  • 编译器标志设置为完全优化(我检查过程序集很好)

(edit: stephentyrone 正确地指出对长数字串的操作应该使用向量化 SIMD 打包操作,比如 rsqrtps — 但这​​里的数组数据结构是仅用于测试目的:我真正想衡量的是在无法矢量化的代码中使用的标量性能.)

(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

推荐答案

sqrtss 给出了一个正确的舍入结果.rsqrtss 给出了倒数的近似,精确到大约 11 位.

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.

sqrtss 正在生成更准确的结果,用于需要准确性时.rsqrtss 适用于近似值足够但需要速度的情况.如果您阅读 Intel 的文档,您还会发现一个指令序列(倒数平方根近似,然后是单个 Newton-Raphson 步骤),它提供几乎全精度(约 23 位精度,如果我没记错的话),并且仍然有些比 sqrtss 快.

sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.

如果速度很重要,并且您确实要在循环中为许多值调用它,则您应该使用这些指令的矢量化版本,rsqrtpssqrtps,它们都处理每条指令四个浮点数.

edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.

这篇关于为什么 SSE 标量 sqrt(x) 比 rsqrt(x) * x 慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆