将双精度数舍入到以位数给出的较低精度的有效方法 [英] Efficient way to round double precision numbers to a lower precision given in number of bits

查看:18
本文介绍了将双精度数舍入到以位数给出的较低精度的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 C# 中,我想将双精度数舍入到较低的精度,以便可以将它们存储在关联数组中不同大小的存储桶中.与通常的四舍五入不同,我想四舍五入到一些有效位.因此,大数字在绝对值上的变化比小数字要大得多,但它们往往会按相同的比例变化.因此,如果我想四舍五入到 10 位二进制数字,我会找到 10 个最高有效位,并将所有低位清零,可能会添加一个小数字进行四舍五入.

我更喜欢将中途"数字四舍五入.

如果它是一个整数类型,这将是一个可能的算法:

<块引用>

 1.查找:最高有效二进制数字集H的从零开始的索引.2. 计算:B = H - P,其中 P 是要舍入的精度的有效位数B 是开始舍入的二进制数字,其中 B = 0 是个位,B = 1 是二进制位,依此类推.3. 加:x = x + 2^B如有必要,这将强制进位(我们将中间值向上取整).4. 归零:x = x mod 2^(B+1).这将清除 B 位和所有低位数字.

问题在于找到一种有效的方法来找到最高位集.如果我使用整数,有很酷的技巧可以找到 MSB.如果可以的话,我不想打电话给 Round(Log2(x)) .该函数将被调用数百万次.

注意:我已经阅读了这个 SO 问题:

将双精度值舍入到(稍微)较低精度的好方法是什么?

它适用于 C++.我正在使用 C#.

更新:

这是我正在使用的代码(根据回答者提供的内容修改):

///<总结>///将数字四舍五入到指定数量的有效二进制数字.//////比如到3位,0到7的数字不变,因为它们只需要3个二进制数字,///但较大的数字会丢失精度://////8 1000 =>1000 8///9 1001 =>1010 10///10 1010 =>1010 10///11 1011 =>1100 12///12 1100 =>1100 12///13 1101 =>1110 14///14 1110 =>1110 14///15 1111 => 10000 16///16 10000 => 10000 16//////这与舍入的不同之处在于我们将舍入发生的位置指定为向右的距离///从最高位集合开始的二进制数字,而不是从零位到左边的距离.///</总结>///<param name="d">要四舍五入的数字.</param>///<param name="digits">要保留的精度二进制位数.</参数>public static double AdjustPrecision(this double d, int digits){//TODO: 不确定这是否适用于规范化和非规范化双精度.需要更多的研究.var shift = 53 - 位数;//IEEE 754 双精度数有 53 位有效位,但有一位是隐含的"且未存储.ulong significandMask = (0xffffffffffffffffUL >> shift) <<转移;var local_d = d;不安全{//双 ->定点(排序)ulong toLong = *(ulong*)(&local_d);//屏蔽掉你的最小信号位var modLong = toLong &意义和掩码;//固定点 ->浮动(排序)local_d = *(double*)(&modLong);}返回本地_d;}

更新 2:Dekker 算法

感谢另一位受访者,我从 Dekker 的算法中得出了这一点.它舍入到最接近的值,而不是像上面的代码那样截断,它只使用安全代码:

private static double[] PowersOfTwoPlusOne;静态数值算法(){PowersOfTwoPlusOne = 新双 [54];for (var i = 0; i < PowersOfTwoPlusOne.Length; i++){如果 (i == 0)PowersOfTwoPlusOne[i] = 1;//特殊情况.别的{long two_to_i_plus_one = (1L <

更新 2:时间安排

我跑了一个测试,发现Dekker的算法比TWICE快!

<块引用>

测试中的调用次数:100,000,000
不安全时间 = 1.922(秒)
安全时间 = 0.799(秒)

解决方案

Dekker 算法会将浮点数拆分为高低部分.如果有效数字中有 s 位(IEEE 754 64 位二进制中为 53),则 *x0 接收高位 s-b 位,这是您请求的,而 *x1 接收剩余的位,您可以将其丢弃.在下面的代码中,Scale 的值应为 2b.如果 b 在编译时已知,例如常量 43,您可以将 Scale 替换为 0x1p43.否则,你必须以某种方式产生 2b.

这需要四舍五入到最近的模式.IEEE 754 算术就足够了,但其他合理的算术也可以.它将关系四舍五入,这不是您所要求的(向上关系).有必要吗?

这假设 x * (Scale + 1) 没有溢出.运算必须以双精度(不更高)进行计算.

void Split(double *x0, double *x1, double x){双 d = x * (比例 + 1);双 t = d - x;*x0 = d - t;*x1 = x - *x0;}

In C#, I want to round doubles to a lower precision so that I can store them in buckets of varying size in an associative array. Unlike the usual rounding, I want to round to a number of significant bits. Thus large numbers will change in absolute terms much more than small numbers, but they will tend to change the same proportionately. So if I want to round to 10 binary digits, I find the ten most significant bits, and zero out all the lower bits, possibly adding a small number for rounding up.

I prefer "half-way" numbers be rounded up.

If it were an integer type, here would be a possible algorithm:

  1. Find: zero-based index of the most significant binary digit set H.
  2. Compute: B = H - P, 
       where P is the number of significant digits of precision to round
       and B is the binary digit to start rounding, where B = 0 is the ones place, 
       B = 1 is the twos place, etc. 
  3. Add: x = x + 2^B 
       This will force a carry if necessary (we round halfway values up).
  4. Zero out: x = x mod 2^(B+1). 
       This clears the B place and all lower digits.

The problem is finding an efficient way to find the highest bit set. If I were using integers, there are cool bit hacks to find the MSB. I do not want to call Round(Log2(x)) if I can help it. This function will be called many millions of times.

Note: I have read this SO question:

What is a good way to round double-precision values to a (somewhat) lower precision?

It works for C++. I am using C#.

UPDATE:

This is the code (modified from what the answerer supplied) as I am using it:

/// <summary>
/// Round numbers to a specified number of significant binary digits.
/// 
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
/// 
///      8    1000 => 1000   8
///      9    1001 => 1010  10
///     10    1010 => 1010  10
///     11    1011 => 1100  12
///     12    1100 => 1100  12
///     13    1101 => 1110  14
///     14    1110 => 1110  14
///     15    1111 =>10000  16
///     16   10000 =>10000  16
///     
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
{
    // TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
    var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
    ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
    var local_d = d;
    unsafe
    {
        // double -> fixed point (sorta)
        ulong toLong = *(ulong*)(&local_d);
        // mask off your least-sig bits
        var modLong = toLong & significandMask;
        // fixed point -> float (sorta)
        local_d = *(double*)(&modLong);
    }
    return local_d;
}

UPDATE 2: Dekker's Algorithm

I derived this from Dekker's algorithm, thanks to the other respondent. It rounds to the closest value, instead of truncating as the above code does, and it uses only safe code:

private static double[] PowersOfTwoPlusOne;

static NumericalAlgorithms()
{
    PowersOfTwoPlusOne = new double[54];
    for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
    {
        if (i == 0)
            PowersOfTwoPlusOne[i] = 1; // Special case.
        else
        {
            long two_to_i_plus_one = (1L << i) + 1L;
            PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;
        }
    }
}

public static double AdjustPrecisionSafely(this double d, int digits)
{
    double t = d * PowersOfTwoPlusOne[53 - digits];
    double adjusted = t - (t - d);
    return adjusted;
}

UPDATE 2: TIMINGS

I ran a test and found that Dekker's algorithm is better than TWICE as fast!

Number of calls in test: 100,000,000
Unsafe Time = 1.922 (sec)
Safe Time = 0.799 (sec)

解决方案

Dekker’s algorithm will split a floating-point number into high and low parts. If there are s bits in the significand (53 in IEEE 754 64-bit binary), then *x0 receives the high s-b bits, which is what you requested, and *x1 receives the remaining bits, which you may discard. In the code below, Scale should have the value 2b. If b is known at compile time, e.g., the constant 43, you can replace Scale with 0x1p43. Otherwise, you must produce 2b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even, which is not what you requested (ties upward). Is that necessary?

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in double precision (not greater).

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

这篇关于将双精度数舍入到以位数给出的较低精度的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆