为什么某些浮点计算会改变他们的做法? (例如123456789f + 1 = 123456792) [英] Why do certain floating point calculations turn the way they do? (e.g. 123456789f +1 = 123456792)

查看:220
本文介绍了为什么某些浮点计算会改变他们的做法? (例如123456789f + 1 = 123456792)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图更好地理解浮点运算,出现和产生的参与错误,以及为什么结果是他们所做的。这里有3个例子,特别是我目前正在处理:
$ b $ p <1> 0.1 + 0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 -1.0 = -1.1102230246251565E-16 加上 0.1 10 给我一个小于1.0 的数字。但是, 0.1 被表示为(双倍),比 0.1 大一些。此外,* 0.1 * 3 *略大于0.3,而* 0.1 * 8 *略小于0.8

123456789f + 1 = 123456792 and 123456789f +4 = 123456800。



这些结果是怎么回事?对于我来说,这一切仍然有点神秘。

解决方案典型的现代处理器和编程语言使用IEEE-754算法(或多或少)对于 float 具有32位二进制浮点,对于 double 具有64位二进制浮点。在 double 中,使用了一个53位有效数字。这意味着,当一个十进制数字被转换为 double 时,它被转换为一些数字 f 2 e ,其中 s 是一个符号(+1或-1), f 是一个无符号整数可以用53位来表示,而 e 是-1074和971之间的整数。 (或者,如果被转换的数字太大,结果可能是+无穷大或-infinity。)(那些知道浮点格式的人可能会抱怨指数在-1023到1023之间,但是我已经将)



将.1转换为 double -55 ,所以 e 是-55。

当我们添加其中的两个时,得到7205759403792794/36028797018963968。这很好,分子仍然小于2 53,所以它符合格式。



当我们添加第三个3602879701896397/36028797018963968,数学结果是10808639105689191/36028797018963968。不幸的是,分子太大,它大于2 53(9007199254740992)。所以浮点硬件不能返回这个数字。如果我们把分子和分母除以2,我们有5404319552844595.5 / 18014398509481984.它具有相同的值,但是分子是不是一个整数。为了使它适合,硬件将其舍入到一个整数。当分数正好是1/2时,规则是舍入以使结果平均,所以硬件返回5404319552844596/18014398509481984。

接下来,我们取当前的总和,5404319552844596/18014398509481984,并再次添加3602879701896397/36028797018963968。这个时候,总数是7205759403792794.5 / 18014398509481984.在这种情况下,硬件下来了,返回7205759403792794/18014398509481984。

然后我们加7205759403792794/18014398509481984和3602879701896397/36028797018963968,总和是9007199254740992.5 / 18014398509481984.请注意,分子不仅有一个分数,但大于2 53 53 / sup>。所以我们必须再次减少它,产生4503599627370496.25 / 9007199254740992.将分子四舍五入为一个整数产生4503599627370496/9007199254740992。

这恰好是1/2。此时,舍入误差巧合取消;当我们添加4503599627370496/9007199254740992和3602879701896397/36028797018963968时,结果正好是5404319552844595.25 / 9007199254740992.硬件向下并返回5404319552844595/9007199254740992。



现在你可以看到我们将会一轮一轮的下降。为了增加3602879701896397/36028797018963968累加和,硬件必须将它的分子分成四部分以使它匹配。这意味着小数部分将始终是.25,并且将被舍入。所以接下来的四个数字也被舍去了。



浮动代替<$ c $ 900 $ $ c> double
,分子必须适合24位,所以它必须小于2 24(16777216)。所以123456789在算术完成之前就太大了。它必须表示为15432099•2 3 ,即123456792.加上1的精确数学结果是15432099.125•2 3,并将该有效数四舍五入为整数15432099•2 3 ,所以没有改变。但是,如果加四,结果是15432099.5•2 3 ,并且四舍五入到15432100•2 3

I'm trying to get a better understanding of floating point arithmetic, the attending errors that occur and accrue, as well as why exactly the results turn out the way they do. Here are 3 Examples in particular I'm currently working on:

1.) 0.1+0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 -1.0 = -1.1102230246251565E-16 aka adding 0.1 10 times gives me a number slightly less than 1.0. However, 0.1 is represented (as a double) as slightly larger than 0.1. Also *0.1*3* is slightly larger than 0.3, but *0.1*8* slightly smaller that 0.8

2.) 123456789f+1 = 123456792 and 123456789f +4 = 123456800.

What's up with those results? It's all still a bit mysterious to me.

解决方案

Typical modern processors and programming languages use IEEE-754 arithmetic (more or less) with 32-bit binary floating-point for float and 64-bit binary floating-point for double. In double, a 53-bit significand is used. This means that, when a decimal numeral is converted to double, it is converted to some number sf•2e, where s is a sign (+1 or −1), f is an unsigned integer that can be represented in 53 bits, and e is an integer between −1074 and 971, inclusive. (Or, if the number being converted is too large, the result can be +infinity or -infinity.) (Those who know the floating-point format may complain that the exponent is properly between −1023 and 1023, but I have shifted the significand to make it an integer. I am describing the mathematical value, not the encoding.)

Converting .1 to double yields 3602879701896397 / 36028797018963968, because, of all the numbers in the required form, that one is closest to .1. The denominator is 2−55, so e is −55.

When we add two of these, we get 7205759403792794 / 36028797018963968. That is fine, the numerator is still less than 253, so it fits in the format.

When we add a third 3602879701896397 / 36028797018963968, the mathematical result is 10808639105689191 / 36028797018963968. Unfortunately, the numerator is too large; it is larger than 253 (9007199254740992). So the floating-point hardware cannot return that number. It has to make it fit somehow.

If we divide the numerator and the denominator by two, we have 5404319552844595.5 / 18014398509481984. This has the same value, but the numerator is not an integer. To make it fit, the hardware rounds it to an integer. When the fraction is exactly 1/2, the rule is to round to make the result even, so the hardware returns 5404319552844596 / 18014398509481984.

Next, we take the current sum, 5404319552844596 / 18014398509481984, and add 3602879701896397 / 36028797018963968 again. This time, the sum is 7205759403792794.5 / 18014398509481984. In this case, the hardware rounds down, returning 7205759403792794 / 18014398509481984.

Then we add 7205759403792794 / 18014398509481984 and 3602879701896397 / 36028797018963968, and the sum is 9007199254740992.5 / 18014398509481984. Note that the numerator not only has a fraction but is larger than 253. So we have to reduce it again, which produces 4503599627370496.25 / 9007199254740992. Rounding the numerator to an integer produces 4503599627370496 / 9007199254740992.

That is exactly 1/2. At this point, the rounding errors have coincidentally canceled; add .1 five times yields exactly .5.

When we add 4503599627370496 / 9007199254740992 and 3602879701896397 / 36028797018963968, the result is exactly 5404319552844595.25 / 9007199254740992. The hardware rounds down and returns 5404319552844595 / 9007199254740992.

Now you can see we are going to round down repeatedly. To add 3602879701896397 / 36028797018963968 to the accumulating sum, the hardware has to divide its numerator by four to make it match. That means the fraction part is always going to be .25, and it will be rounded down. So the next four sums are also rounded down. We end up with 9007199254740991 / 9007199254740992, which is just less than 1.

With float instead of double, the numerator has to fit in 24 bits, so it has to be less than 224 (16777216). So 123456789 is too big even before any arithmetic is done. It has to be expressed as 15432099 • 23, which is 123456792. The exact mathematical result of adding 1 is 15432099.125 • 23, and rounding that significand to an integer yields 15432099 • 23, so there is no change. But, if you add four, the result is 15432099.5 • 23, and that rounds to 15432100 • 23.

这篇关于为什么某些浮点计算会改变他们的做法? (例如123456789f + 1 = 123456792)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆