双用C / C快速平方反比++ [英] Fast inverse square of double in C/C++

查看:130
本文介绍了双用C / C快速平方反比++的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我在分析程序中的热点,绝对是这个

Recently I was profiling a program in which the hotspot is definitely this

double d = somevalue();
double d2=d*d;
double c = 1.0/d2   // HOT SPOT

该值D2是不是以后使用,因为我只需要值c。前一段时间我读过有关的快速倒数平方根卡马克方法,这显然不是这样的,但我不知道是否有类似的算法可以帮我计算1 / X ^ 2。

The value d2 is not used after because I only need value c. Some time ago I've read about the Carmack method of fast inverse square root, this is obviously not the case but I'm wondering if a similar algorithms can help me computing 1/x^2.

我需要非常精确的precision,我检查了我的程序并没有给出正确的结果与海湾合作委员会-ffast-运算功能。 (G ++ - 4.5)

I need quite accurate precision, I've checked that my program doesn't give correct results with gcc -ffast-math option. (g++-4.5)

推荐答案

的技巧做快速的平方根之类获得通过牺牲precision他们的表现。 (当然,其中的大多数。)

The tricks for doing fast square roots and the like get their performance by sacrificing precision. (Well, most of them.)

  1. 你确定你需要 precision?你可以牺牲precision很轻松地:

  1. Are you sure you need double precision? You can sacrifice precision easily enough:

double d = somevalue();
float c = 1.0f / ((float) d * (float) d);

1.0F 在这种情况下绝对强制性的,如果你使用 1.0 而不是您将获得 precision。

The 1.0f is absolutely mandatory in this case, if you use 1.0 instead you will get double precision.

您是否尝试过启用马虎的数学你的编译器?在GCC您可以使用 -ffast-数学,还有为其他编译器类似的选项。马虎的数学可能是绰绰有余好你的应用程序。 (编辑:我没有看到生成的程序集有什么区别)

Have you tried enabling "sloppy" math on your compiler? On GCC you can use -ffast-math, there are similar options for other compilers. The sloppy math may be more than good enough for your application. ( I did not see any difference in the resulting assembly.)

如果你正在使用gcc,你有没有考虑过使用 -mrecip ?有一个倒数估计函数,它仅具有大约12比特的precision,但它是要快得多。您可以使用牛顿迭代的方法来增加结果的precision。该 -mrecip 选项将导致编译器自动生成的倒数估计和牛顿迭代步骤为你,但你总是可以自己编写的程序集,如果你想微调性能 - precision权衡。 (牛顿迭代收敛的非常的快。)(编辑:我无法让GCC产生RCPSS见下文)

If you are using GCC, have you considered using -mrecip? There is a "reciprocal estimate" function which only has about 12 bits of precision, but it is much faster. You can use the Newton-Raphson method to increase the precision of the result. The -mrecip option will cause the compiler to automatically generate the reciprocal estimate and Newton-Raphson steps for you, although you can always write the assembly yourself if you want to fine tune the performance-precision trade-off. (Newton-Raphson converges very quickly.) ( I was unable to get GCC to generate RCPSS. See below.)

我发现了一个博客帖子()讨论你正在经历的具体问题,而笔者的结论是,像卡马克方法的技术都无法与RCPSS指令(这在海湾合作委员会 -mrecip 标志使用)的竞争力。

I found a blog post (source) discussing the exact problem you are going through, and the author's conclusion is that the techniques like the Carmack method are not competitive with the RCPSS instruction (which the -mrecip flag on GCC uses).

原因为什么划分可以这么慢是因为处理器一般只有一个划分单元,它的往往不是流水线。所以,你可以有几个乘法在同时执行所有的管道,但没有划分可以发出,直到previous师完成。

The reason why division can be so slow is because processors generally only have one division unit and it's often not pipelined. So, you can have a few multiplications in the pipe all executing simultaneously, but no division can be issued until the previous division finishes.

  1. 卡马克的方法:它是过时的现代处理器,具有互惠估计运codeS。相比12位 RCPSS 的什么 - 对于倒数,我见过的最好的版本只给了precision一位。我认为这就是诀窍运作良好,为倒数平方根巧合;巧合的是不太可能重演。

  1. Carmack's method: It is obsolete on modern processors, which have reciprocal estimation opcodes. For reciprocals, the best version I've seen only gives one bit of precision -- nothing compared to the 12 bits of RCPSS. I think it is a coincidence that the trick works so well for reciprocal square roots; a coincidence that is unlikely to be repeated.

重新标记变量。至于编译器而言,有 1.0 /(X * X)双X2 = X * X之间的差异很小; 1.0 / X2 。我会感到惊讶,如果你发现了一个编译器产生不同的code两个版本,优化开启,即使来的最低水平。

Relabeling variables. As far as the compiler is concerned, there is very little difference between 1.0/(x*x) and double x2 = x*x; 1.0/x2. I would be surprised if you found a compiler that generates different code for the two versions with optimizations turned on even to the lowest level.

使用 POW 。该 POW 库函数是一个总的怪物。随着GCC的 -ffast-数学关闭时,库调用是相当昂贵的。随着GCC的 -ffast-数学打开,你会得到完全相同的程序集$ C $下 POW(X,-2)为你做 1.0 /(X * X),所以没有好处。

Using pow. The pow library function is a total monster. With GCC's -ffast-math turned off, the library call is fairly expensive. With GCC's -ffast-math turned on, you get the exact same assembly code for pow(x, -2) as you do for 1.0/(x*x), so there is no benefit.

下面是一个牛顿迭代近似的双precision平方反比的例子浮点值。

Update

Here is an example of a Newton-Raphson approximation for the inverse square of a double-precision floating-point value.

static double invsq(double x)
{
    double y;
    int i;
    __asm__ (
        "cvtpd2ps %1, %0\n\t"
        "rcpss %0, %0\n\t"
        "cvtps2pd %0, %0"
        : "=x"(y)
        : "x"(x));
    for (i = 0; i < RECIP_ITER; ++i)
        y *= 2 - x * y;
    return y * y;
}

不幸的是, RECIP_ITER = 1 我的电脑上的基准把它比简单的版本稍微慢一点(约5%) 1.0 /(X * X)。它的速度更快(2倍速度)与零次迭代,但你只能得到12位的precision。我不知道12位是足够你。

Unfortunately, with RECIP_ITER=1 benchmarks on my computer put it slightly slower (~5%) than the simple version 1.0/(x*x). It's faster (2x as fast) with zero iterations, but then you only get 12 bits of precision. I don't know if 12 bits is enough for you.

我觉得这里有一个问题是,这是太小了微优化;在这种规模的编译器作家与装配黑客几乎平起平坐。也许,如果我们有更大的图片,我们可以看到一种方法,使其更快。

I think one of the problems here is that this is too small of a micro-optimization; at this scale the compiler writers are on nearly equal footing with the assembly hackers. Maybe if we had the bigger picture we could see a way to make it faster.

例如,你说的那个 -ffast-数学造成了precision不期望的损失;这可能表明您正在使用的算法在数值稳定性问题。有了正确的选择算法,很多问题都可以解决了浮法双的,而不是。 (当然,你可能只需要超过24位。我不知道。)

For example, you said that -ffast-math caused an undesirable loss of precision; this may indicate a numerical stability problem in the algorithm you are using. With the right choice of algorithm, many problems can be solved with float instead of double. (Of course, you may just need more than 24 bits. I don't know.)

我想,如果要计算一些这些平行的 RCPSS 方法眼前一亮。

I suspect the RCPSS method shines if you want to compute several of these in parallel.

这篇关于双用C / C快速平方反比++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆