快速方法将整数乘以适当的分数而没有浮点或溢出 [英] Fast method to multiply integer by proper fraction without floats or overflow

查看:62
本文介绍了快速方法将整数乘以适当的分数而没有浮点或溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的程序经常需要执行以下计算:

My program frequently requires the following calculation to be performed:

给出:

  • N是32位整数
  • D是32位整数
  • abs(N)< = abs(D)
  • D!= 0
  • X是任意值的32位整数

查找:

  • X * N/D为四舍五入的整数,X缩放为N/D(即10 * 2/3 = 7)

很明显,我可以直接使用 r = x * n/d ,但是我经常会从 x * n 溢出.如果我改为执行 r = x *(n/d),则由于整数除法除去小数部分,我只能得到0或x.然后是 r = x *(float(n)/d),但是在这种情况下我不能使用浮点数.

Obviously I could just use r=x*n/d directly but I will often get overflow from the x*n. If I instead do r=x*(n/d) then I only get 0 or x due to integer division dropping the fractional component. And then there's r=x*(float(n)/d) but I can't use floats in this case.

精度会很高,但并不像速度和决定性功能那么关键(总是在给定相同输入的情况下返回相同的值).

Accuracy would be great but isn't as critical as speed and being a deterministic function (always returning the same value given the same inputs).

N和D当前已签名,但如果有帮助,我可以解决它们始终未签名的问题.

N and D are currently signed but I could work around them being always unsigned if it helps.

一个通用的函数可以与任何X值(以及N和D,只要N< = D)一起使用,是理想的,因为此操作以各种不同的方式使用,但我也有一个特定的情况,其中X的值X是一个已知的2的恒定幂(准确地说是2048),而加快特定的调用速度将是一个很大的帮助.

A generic function that works with any value of X (and N and D, as long as N <= D) is ideal since this operation is used in various different ways but I also have a specific case where the value of X is a known constant power of 2 (2048, to be precise), and just getting that specific call sped up would be a big help.

目前,我正在使用64位乘法和除法来避免溢出(基本上是 int multByProperFraction(int x,int n,int d){{return(__int64)x * n/d;} ,但带有一些断言和四舍五入的四舍五入,而不是舍入).

Currently I am accomplishing this using 64-bit multiply and divide to avoid overflow (essentially int multByProperFraction(int x, int n, int d) { return (__int64)x * n / d; } but with some asserts and extra bit fiddling for rounding instead of truncating).

不幸的是,我的探查器报告64位除法函数占用了过多的CPU(这是一个32位应用程序).我试图减少执行此计算的频率,但是用尽了很多方法,因此,即使有可能,我也在尝试找出一种更快的方法.在X的常数为2048的特定情况下,我使用了位移而不是乘法,但这并没有太大帮助.

Unfortunately, my profiler is reporting the 64-bit divide function as taking up way too much CPU (this is a 32-bit application). I've tried to reduce how often I need to do this calculation but am running out of ways around it, so I'm trying to figure out a faster method, if it is even possible. In the specific case where X is a constant 2048, I use a bit shift instead of multiply but that doesn't help much.

推荐答案

我现在对几种可能的解决方案进行了基准测试,其中包括来自其他来源的奇怪/聪明的解决方案,例如将32位div&mod添加或使用农民数学,这是我的结论:

I've now benchmarked several possible solutions, including weird/clever ones from other sources like combining 32-bit div & mod & add or using peasant math, and here are my conclusions:

首先,如果您仅针对Windows并使用VSC ++,则只需使用MulDiv().它相当快(比在我的测试中直接使用64位变量要快),同时仍然一样准确,而且可以为您舍入结果.我什至找不到考虑到诸如unsigned-only和N< = D之类限制的Windows上使用VSC ++进行此类操作的任何高级方法.

First, if you are only targeting Windows and using VSC++, just use MulDiv(). It is quite fast (faster than directly using 64-bit variables in my tests) while still being just as accurate and rounding the result for you. I could not find any superior method to do this kind of thing on Windows with VSC++, even taking into account restrictions like unsigned-only and N <= D.

但是,就我而言,即使在跨平台的情况下,具有确定性结果的功能也比速度更重要.在我用作测试的另一个平台上,使用32位库时,64位除法比32位除法要慢得多,并且没有MulDiv()可以使用.这个平台上的64位除法运算所需的时间是32位除法运算的26倍左右(但64位乘法与32位运算法则一样快...).

However, in my case having a function with deterministic results even across platforms is even more important than speed. On another platform I was using as a test, the 64-bit divide is much, much slower than the 32-bit one when using the 32-bit libraries, and there is no MulDiv() to use. The 64-bit divide on this platform takes ~26x as long as a 32-bit divide (yet the 64-bit multiply is just as fast as the 32-bit version...).

因此,如果您有像我这样的案例,我将分享我所获得的最佳结果,事实证明这只是对chux答案的优化.

So if you have a case like me, I will share the best results I got, which turned out to be just optimizations of chux's answer.

我将在下面分享的两种方法都利用以下功能(尽管特定于编译器的内在函数实际上只能帮助提高Windows中的MSVC的速度)

Both of the methods I will share below make use of the following function (though the compiler-specific intrinsics only actually helped in speed with MSVC in Windows):

inline u32 bitsRequired(u32 val)
{
    #ifdef _MSC_VER
        DWORD r = 0;
        _BitScanReverse(&r, val | 1);
        return r+1;
    #elif defined(__GNUC__) || defined(__clang__)
        return 32 - __builtin_clz(val | 1);
    #else
        int r = 1;
        while (val >>= 1) ++r;
        return r;
    #endif
}

现在,如果x是一个大小为16位或更小的常数,并且您可以预先计算所需的位,那么我发现此函数在速度和准确性方面取得了最佳结果:

Now, if x is a constant that's 16-bit in size or smaller and you can pre-compute the bits required, I found the best results in speed and accuracy from this function:

u32 multConstByPropFrac(u32 x, u32 nMaxBits, u32 n, u32 d)
{
    //assert(nMaxBits == 32 - bitsRequired(x));
    //assert(n <= d);
    const int bitShift = bitsRequired(n) - nMaxBits;
    if( bitShift > 0 )
    {
        n >>= bitShift;
        d >>= bitShift;
    }

    // Remove the + d/2 part if don't need rounding
    return (x * n + d/2) / d;
}

在具有慢速64位除法的平台上,上述功能的运行速度是 return((u64)x * n + d/2)/d; 的〜16.75x,并且具有用a进行测试时,平均准确度为99.999981%(比较预期值与x范围的返回值差,即,当x为2048时预期值返回+/- 1将为100-(1/2048 * 100)= 99.95%准确度)大约一百万个随机输入,其中大约一半通常是溢出.最坏情况下的准确性为99.951172%.

On the platform with the slow 64-bit divide, the above function ran ~16.75x as fast as return ((u64)x * n + d/2) / d; and with an average 99.999981% accuracy (comparing difference in return value from expected to range of x, i.e. returning +/-1 from expected when x is 2048 would be 100 - (1/2048 * 100) = 99.95% accurate) when testing it with a million or so randomized inputs where roughly half of them would normally have been an overflow. Worst-case accuracy was 99.951172%.

对于一般用例,我从以下各项中获得了最佳结果(并且无需限制N< = D即可启动!):

For the general use case, I found the best results from the following (and without needing to restrict N <= D to boot!):

u32 scaleToFraction(u32 x, u32 n, u32 d)
{
    u32 bits = bitsRequired(x);
    int bitShift = bits - 16;
    if( bitShift < 0 ) bitShift = 0;
    int sh = bitShift;
    x >>= bitShift;

    bits = bitsRequired(n);
    bitShift = bits - 16;
    if( bitShift < 0 ) bitShift = 0;
    sh += bitShift;
    n >>= bitShift;

    bits = bitsRequired(d);
    bitShift = bits - 16;
    if( bitShift < 0 ) bitShift = 0;
    sh -= bitShift;
    d >>= bitShift;

    // Remove the + d/2 part if don't need rounding
    u32 r = (x * n + d/2) / d;
    if( sh < 0 )
        r >>= (-sh);
    else //if( sh > 0 )
        r <<= sh;

    return r;
}

在速度较慢的64位除法平台上,上述功能的运行速度约为使用64位变量的18.5倍,平均精度为99.999426%,最差情况精度为99.947479%.

On the platform with the slow 64-bit divide, the above function ran ~18.5x as fast as using 64-bit variables and with 99.999426% average and 99.947479% worst-case accuracy.

通过弄乱移位,我能够获得更高的速度或更高的精度,例如,在并非严格必要的情况下,尽量不将其完全降低到16位,但是速度的任何提高都是很高的准确性会降低成本,反之亦然.

I was able to get more speed or more accuracy by messing with the shifting, such as trying to not shift all the way down to 16-bit if it wasn't strictly necessary, but any increase in speed came at a high cost in accuracy and vice versa.

我测试过的其他任何方法都没有达到相同的速度或精度,大多数方法都比仅使用64位方法慢或精度损失巨大,因此不值得研究.

None of the other methods I tested came even close to the same speed or accuracy, most being slower than just using the 64-bit method or having huge loss in precision, so not worth going into.

显然,不能保证其他任何人在其他平台上也会得到类似的结果!

Obviously, no guarantee that anyone else will get similar results on other platforms!

编辑:用普通的代码替换了一些令人费解的hack,这些代码实际上通过允许编译器执行工作而实际上运行得更快.

Replaced some bit-twiddling hacks with plain code that actually ran faster anyway by letting the compiler do its job.

这篇关于快速方法将整数乘以适当的分数而没有浮点或溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆