如何计算双精度浮点数 [英] How to Calculate Double + Float Precision

查看:753
本文介绍了如何计算双精度浮点数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在想如何计算浮点/双精度/范围数字
-3.402823e38 .. 3.402823e38和-1.79769313486232e308 .. 1.79769313486232e308。

对于int32,你可以做2 ^ 32 = 4294967296/2,你可以得到-2147483648到2147483647的范围。那么如何计算float和double的精度数。我想我正在寻找错误的条款,因为没有任何地方出现。

解决方案

好吧,这两种类型实际上看起来如下所示:

 [sign] [exponent] [mantissa] 

代表一个数字以下面的形式:


[符号] 1. [尾数]×2 [指数]


,指数和尾数的大小是变化的。对于 float 指数是8位宽,而 double 有11位指数。此外,指数存储为unsigned,其中 float 为127, double 为1023 C>。对于 float 和1022到1023来说,这个结果的范围是从&的减号126到127; double 指数是2 的指数,所以当计算2 <127>时,你将得到1.7×10 38 可以让你在 float 最大值的大概范围内。类似地,对于 double 与9×10 307

显然这些数字并不完全我们期望的那些。这是尾数发挥作用的地方。尾数表示一个标准化的二进制数,总是以1开头(这是标准化的部分)。其余的只是点之后的数字。因为最大的尾数是大约1.111111111 ... 二进制,它几乎是2,所以我们将得到大约3.4×1038像 float 的最大值和1.79×10 308 作为 double 的最大值。



正如马克指出下面(和下面的问题),确切的公式如下: / p>



其中 e 是指数中的位数,而 p 是尾数中的位数,包括上述隐式位(由于归一化)。该公式复制了我们上面看到的,只有现在是准确的。第一个因素,2 e − 1是最大指数,乘以2(然后我们在第二个因子中保存两个)。第二个因素是我们可以表示的最大数字 。我上面说过,这个数字几乎是两个。由于我们在这个公式中将指数夸大了两倍,所以我们需要考虑这个因素,现在有一个几乎是一个的数字。无论如何,对于浮动(用 e > = 8和 p = 24),我们得到确切的值340282346638528859811704183484516925440或大致3.4×1038。 然后产率(以ë = 10和 P = 53)179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368或大致1.80×10 308



[/编辑] 另外一件事:在你的问题中提出精确度一词,但是你引用了这些类型的范围。精度是一个完全不同的东西,指的是多少有效数字类型可以保留。再次,这里的答案在于 float double 的23和52位的尾数。由于数字被存储了规范化,所以我们实际上已经添加了一个隐含的位,这使我们在24位和53位。现在,小数点后面的数字(或二进制)点的工作方式如下:

  1. 1 0 1 1 
↑↑↑↑
2 ^ 0 2 ^ -1 2 ^ -2 2 ^ -3 2 ^ -4
= = = =
1 0.5 0.25 0.125 0.0625

所以 double 尾数表示的值大致为2.2×10 -16;或2〜52;因此如果指数为1,则这是我们可以添加到数的最小值 - 将 double 精度放在16位十进制数字左右。同样,对于浮点约七位数字。


I have been trying to find how to calculate the Floating/Double precision/range numbers -3.402823e38 .. 3.402823e38 and -1.79769313486232e308 .. 1.79769313486232e308.

For int32 you would do 2^32=4294967296/2 you get a range of -2147483648 to 2147483647. So how do i figure out the precision numbers for float and double. I think i am searching the wrong terms since nothing is coming up anywhere.

解决方案

Well, both types actually look like the following:

[sign] [exponent] [mantissa]

representing a number in the following form:

[sign] 1.[mantissa] × 2[exponent]

with the size of the exponent and mantissa varying. For float the exponent is eight bits wide, while double has an eleven-bit exponent. Furthermore, the exponent is stored unsigned with a bias which is 127 for float and 1023 for double. This results in a range for the exponent of −126 through 127 for float and −1022 though 1023 for double.

The exponent is the exponent for 2something so when calculating 2127 you'll get 1.7 × 1038 which gets you in the approximate range of the float maximum value. Similarly for double with 9 × 10307.

Obviously those numbers are not exactly those we expect. This is where the mantissa comes into play. The mantissa represents a normalized binary number that always begins with "1." (that's the normalized part). The rest is simply the digits after the dot. Since the maximum mantissa is then roughly 1.111111111... in binary, which is almost 2, we'll get approximately 3.4 × 1038 as float's maximum value and 1.79 × 10308 as the maximum value for double.

[EDIT 2011-01-06] As Mark points out below (and below the question), the exact formula is the following:

where e is the number of bits in the exponent and p is the number of bits in the mantissa, including the aforementioned implicit bit (due to normalization). The formula replicates what we have seen above, only now accurate. The first factor, 22e − 1, is the maximum exponent, multiplied by two (we save the two in the second factor then). The second factor is the largest number we can represent below one. I said above that the number is almost two. Since we exaggerated the exponent by a factor of two in this formula, we need to account for that and now have a number that is almost one. I hope it's not too confusing.

In any case, for float (with e = 8 and p = 24) we get the exact value 340282346638528859811704183484516925440 or roughly 3.4 × 1038. double then yields (with e = 10 and p = 53) 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368 or roughly 1.80 × 10308.

[/EDIT]

Another thing: You're bringing up the term "precision" in your question but you quote the ranges of the types. Precision is a quite different thing and refers to how many significant digits the type can retain. Again, the answer here lies in the mantissa which is 23 and 52 bits for float and double, respectively. Since the numbers are stored normalized we actually have an implicit bit added to that, which puts us at 24 and 53 bits. Now, the way how digits after the decimal (or binary here) point work is the following:

 1.   1     0     1     1
 ↑    ↑     ↑     ↑     ↑
2^0  2^-1  2^-2  2^-3  2^-4
 =    =     =     =     =
 1   0.5   0.25  0.125 0.0625

So the very last digit in the double mantissa represents a value of roughly 2.2 × 10−16 or 2−52, so if the exponent is 1, this is the smallest value we can add to the number – placing the double precision around 16 decimal digits. Likewise for float with roughly seven digits.

这篇关于如何计算双精度浮点数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆