有效的方式来舍双精度数,以位的数量给予较低的精度 [英] Efficient way to round double precision numbers to a lower precision given in number of bits

查看:110
本文介绍了有效的方式来舍双精度数,以位的数量给予较低的精度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在C#中,我要圆双打到一个较低的精度,这样我可以将它们存储在一个关联数组大小不同的桶。不同于通常的四舍五入,我想圆了一些显著位。因而大量将绝对值比小数目多的变化,但是它们往往会成比例地改变相同。所以,如果我想圆10个二进制数字,我找到十个最显著位和零出所有的低位,可能增加对围捕一个小数目。



我更喜欢中途数字四舍五入



如果它是一个整数类型,这里将是一个可能的算法:




  1.发现:最显著的二进制数字集H. 
2.计算的从零开始的索引:b = ħ - P,
其中P是精密至轮
和b显著的位数是二进制数字开始四舍五入,其中b = 0是个位,
b = 1是三三两两的地方,等
3.添加:如有必要,X = X + 2 ^ b
这将迫使进位(我们半路圆值高达)。
4.零出:X = X MOD 2 ^(B + 1)。
这将清除B座和所有低位。




问题是要找到一种有效的方式来找到最高位设置。
如果我使用整数,有清凉位黑客找到MSB。
我不想叫圆(LOG2(X)),如果我能帮助它。
该函数将被调用的次数数百万



注:我已阅读本SO问题:



什么是圆双精度值到(有点)低精度的好方法?



它适用于C ++。我使用C#



更​​新:



这是(从提供什么回答者修改)作为代码我使用它:

  ///<总结> 
///回合编号,以的显著二进制数字指定数量。
///
///例如,3个地方,从零到七数字是不变的,因为它们只需要3个二进制位,
///但较大的数字丢失精度:
///
/// 8 = 1000> 1000 8
/// 9 1001 => 1010 10
/// 10 = 1010> 1010 10
/// 11 = 1011> 1100 12
/// 12 = 1100> 1100 12
/// 13 = 1101> 1110年14
/// 14 1110 => 1110年14
/// 15 1111 => 10000 16
/// 16 = 10000> 10000 16
///
///这是从不同的四舍五入我们正在指定,其中舍入的发生是由于在正确的
中的距离///中从最高位组,而不是从零位的留下的距离二进制数字的地方。
///< /总结>
///&下; PARAM NAME =d的>到舍入数字与所述; /参数>
///< PARAM NAME =数字>将精度的二进制位数保存。 < /参数>
公共静态双AdjustPrecision(这双D,INT位)
{
// TODO:不知道这是否会为规范化和非规范化的双打工作。需要更多的研究。
VAR移= 53 - 数字; // IEEE 754双精度具有尾数的53位,但有一点是隐含,而不是存储。
ULONG significandMask =(0xffffffffffffffffUL>>移)LT;<转移;
VAR local_d = D;
不安全
{
//双 - >固定点(八九不离十)
ULONG toLong = *(* ULONG)(安培; local_d);
//屏蔽掉你至少-SIG位
VAR modLong = toLong&安培; significandMask;
//定点 - >浮动(八九不离十)
local_d = *(*双)(安培; modLong);
}
返回local_d;
}



更​​新2:德克尔的算法



我这个衍生自德克尔的算法,由于其他应诉。其四舍五入为最接近的价值,而不是截断为上面的代码呢,它仅使用安全代码:

 私有静态双[] PowersOfTwoPlusOne; 

静态NumericalAlgorithms()
{
PowersOfTwoPlusOne =新的双[54];
表示(变量I = 0; I&下; PowersOfTwoPlusOne.Length;我++)
{
如果(ⅰ== 0)
PowersOfTwoPlusOne [I] = 1; // 特例。
,否则
{
长two_to_i_plus_one =(1L<< I)+ 1L;
PowersOfTwoPlusOne [I] =(双)two_to_i_plus_one;
}
}
}

公共静态双AdjustPrecisionSafely(这双D,INT位)
{
双T = D * PowersOfTwoPlusOne [53 - 数字]。
双调节= T - (T - D);
收益调整;
}



更​​新2:时序



我跑了测试,发现德克尔的算法快一倍更好




在测试呼叫数:亿

不安全时间= 1.922(秒)结果
安全时间= 0.799(秒)



解决方案

德克尔的算法将拆分浮点数分为高,低的部分。如果存在的取值的中尾数位(53 IEEE 754 64位二进制),那么 * X0 接收高的取值 - 位,这是你的要求是什么,以及 * X1 接收其余位,你可以丢弃。在下面的代码,缩放的值应为2 。如果在编译时,如众所周知,中恒43,可以替换缩放 0x1p43 。否则,您必须以某种方式产生2



这需要舍入到最近的模式。 IEEE 754算术就足够了,但其他合理的算法可能也没有问题。其四舍五入关系,甚至,这是不是你要求的(关系向上)。有必要吗?



这假定 X *(比例+ 1)不会溢出。这些操作必须在双精度(不大于)进行评估。

 无效斯普利特(双* X0,双* X1,Double X的)
{
双D = X *(比例+ 1);
* X0 = D - (D - X);
* X1 = X - * X0;
}


In C#, I want to round doubles to a lower precision so that I can store them in buckets of varying size in an associative array. Unlike the usual rounding, I want to round to a number of significant bits. Thus large numbers will change in absolute terms much more than small numbers, but they will tend to change the same proportionately. So if I want to round to 10 binary digits, I find the ten most significant bits, and zero out all the lower bits, possibly adding a small number for rounding up.

I prefer "half-way" numbers be rounded up.

If it were an integer type, here would be a possible algorithm:

  1. Find: zero-based index of the most significant binary digit set H.
  2. Compute: B = H - P, 
       where P is the number of significant digits of precision to round
       and B is the binary digit to start rounding, where B = 0 is the ones place, 
       B = 1 is the twos place, etc. 
  3. Add: x = x + 2^B 
       This will force a carry if necessary (we round halfway values up).
  4. Zero out: x = x mod 2^(B+1). 
       This clears the B place and all lower digits.

The problem is finding an efficient way to find the highest bit set. If I were using integers, there are cool bit hacks to find the MSB. I do not want to call Round(Log2(x)) if I can help it. This function will be called many millions of times.

Note: I have read this SO question:

What is a good way to round double-precision values to a (somewhat) lower precision?

It works for C++. I am using C#.

UPDATE:

This is the code (modified from what the answerer supplied) as I am using it:

/// <summary>
/// Round numbers to a specified number of significant binary digits.
/// 
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
/// 
///      8    1000 => 1000   8
///      9    1001 => 1010  10
///     10    1010 => 1010  10
///     11    1011 => 1100  12
///     12    1100 => 1100  12
///     13    1101 => 1110  14
///     14    1110 => 1110  14
///     15    1111 =>10000  16
///     16   10000 =>10000  16
///     
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
{
    // TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
    var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
    ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
    var local_d = d;
    unsafe
    {
        // double -> fixed point (sorta)
        ulong toLong = *(ulong*)(&local_d);
        // mask off your least-sig bits
        var modLong = toLong & significandMask;
        // fixed point -> float (sorta)
        local_d = *(double*)(&modLong);
    }
    return local_d;
}

UPDATE 2: Dekker's Algorithm

I derived this from Dekker's algorithm, thanks to the other respondent. It rounds to the closest value, instead of truncating as the above code does, and it uses only safe code:

private static double[] PowersOfTwoPlusOne;

static NumericalAlgorithms()
{
    PowersOfTwoPlusOne = new double[54];
    for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
    {
        if (i == 0)
            PowersOfTwoPlusOne[i] = 1; // Special case.
        else
        {
            long two_to_i_plus_one = (1L << i) + 1L;
            PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;
        }
    }
}

public static double AdjustPrecisionSafely(this double d, int digits)
{
    double t = d * PowersOfTwoPlusOne[53 - digits];
    double adjusted = t - (t - d);
    return adjusted;
}

UPDATE 2: TIMINGS

I ran a test and found that Dekker's algorithm is better than TWICE as fast!

Number of calls in test: 100,000,000
Unsafe Time = 1.922 (sec)
Safe Time = 0.799 (sec)

解决方案

Dekker’s algorithm will split a floating-point number into high and low parts. If there are s bits in the significand (53 in IEEE 754 64-bit binary), then *x0 receives the high s-b bits, which is what you requested, and *x1 receives the remaining bits, which you may discard. In the code below, Scale should have the value 2b. If b is known at compile time, e.g., the constant 43, you can replace Scale with 0x1p43. Otherwise, you must produce 2b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even, which is not what you requested (ties upward). Is that necessary?

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in double precision (not greater).

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    *x0 = d - (d - x);
    *x1 = x - *x0;
}

这篇关于有效的方式来舍双精度数,以位的数量给予较低的精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆