浮动限制 [英] float limits

查看:60
本文介绍了浮动限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么FLT_DIG(来自< float.h>)是6而DBL_DIB是15?


做数学运算,浮点数的尾数是24位= 2 ^ 24-1最大值

= 16,777,215.0f。任何比这更大的8位奇数#

四舍五入。

对于双打,尾数是53位= 2 ^ 53-1 max value =

9,007,199,254,740,991.0l(这是一个L)。所以16位数的奇数

将大于四舍五入。为了得到实际的精度,我们需要记录这些数字的对数(基数为10),分别得到7.22和15.95




....浮点数的精度大于7位,仅加倍
大于15位数。那么MS如何保证没有舍入错误

15位双打但6位浮点数(如果我理解正确的话,

必须使用精确度的最后一位数来完善数字...

数字不仅仅被截断为7和15位数......)


我对双打案件的所有遗漏?看起来他们应该保证14位数。

Why is it that FLT_DIG (from <float.h>) is 6 while DBL_DIB is 15?

Doing the math, the mantissa for floats is 24 bits = 2^24-1 max value
= 16,777,215.0f. Anything 8-digit odd # greater than that will be
rounded off.
For doubles, the mantissa is 53 bits = 2^53-1 max value =
9,007,199,254,740,991.0l (that''s an L). So 16 digit odd numbers
greater than that will be rounded off. To get the actual precision we
take log(base 10) of those numbers and get 7.22 and 15.95
respectively.

....floats have greater than 7 digits precision and doubles only
greater than 15 digits. So how does MS guarantee no rounding errors
for 15 digit doubles yet 6 digit floats (if I understand correctly,
the last digit of precision must be used to round off the number...the
numbers are not just truncated at 7 & 15 digits...)

Anything I''m missing for the doubles case? It looks like they should
be guaranteeing 14 digits.

推荐答案

>为什么是FLT_DIG(来自< float.h>)是6而DBL_DIB是15?
>Why is it that FLT_DIG (from <float.h>) is 6 while DBL_DIB is 15?

做数学,浮点数的尾数是24位= 2 ^ 24-1最大值
= 16,777,215.0f 。任何比这更大的8位奇数#将被四舍五入。


我不认为你可以计算隐藏的1实际上没有存储在数字中的位数。最大尾数是2 ** 24-1。

最小尾数,不改变指数,是2 ** 23.

那是'2 ** 23组合。

对于双打,尾数是53位= 2 ^ 53-1最大值=
9,007,199,254,740,991.0l(这是一个L)。因此,大于16位的奇数将被四舍五入。为了获得实际精度,我们将这些数字的log(基数为10)分别得到7.22和15.95



我认为你应该从

中减去.30(记录基数10的2,一位),分别给出6.92和15.65。 br />
...浮点数的精度大于7位,仅加倍大于15位数。那么MS如何保证没有舍入错误
15位双打但6位浮点数(如果我理解正确的话,
精度的最后一位必须用来舍入数字...... 数字不仅仅被截断为7和15位......)


这不仅仅是微软:FreeBSD具有与i386相同的值

平台。而且我相信两者都是正确的。

我对双打案件有什么遗漏?看起来他们应该保证14位数字。

Doing the math, the mantissa for floats is 24 bits = 2^24-1 max value
= 16,777,215.0f. Anything 8-digit odd # greater than that will be
rounded off.
I don''t think you get to count the "hidden 1" bit that actually
is not stored in the number. The maximum mantissa is 2**24-1.
The minimum mantissa, without changing the exponent, is 2**23.
That''s 2**23 combinations.
For doubles, the mantissa is 53 bits = 2^53-1 max value =
9,007,199,254,740,991.0l (that''s an L). So 16 digit odd numbers
greater than that will be rounded off. To get the actual precision we
take log(base 10) of those numbers and get 7.22 and 15.95
respectively.
I think you should subtract .30 (log base 10 of 2, one bit) from
each of those, giving 6.92 and 15.65, respectively.
...floats have greater than 7 digits precision and doubles only
greater than 15 digits. So how does MS guarantee no rounding errors
for 15 digit doubles yet 6 digit floats (if I understand correctly,
the last digit of precision must be used to round off the number...the
numbers are not just truncated at 7 & 15 digits...)
It''s not just Microsoft: FreeBSD has the same values for the i386
platform. And I believe both are correct.
Anything I''m missing for the doubles case? It looks like they should
be guaranteeing 14 digits.




ANSI C给出了< float.h>中常量的公式。

如果b = FLT_RADIX(基数)和p = FLT_MANT_DIG(该基数中的数字),则


FLT_DIG = floor((p-1)* log10(b)) +(如果b是10的幂,则为1,否则为0)。


Gordon L. Burditt



ANSI C gives formulas for the constants in <float.h>.
if b = FLT_RADIX (the base) and p = FLT_MANT_DIG (digits in that base), then

FLT_DIG = floor((p-1)*log10(b) ) + (1 if b is a power of 10, 0 otherwise).

Gordon L. Burditt


2004年8月25日17:06:31 -0700, zi****@gmail.com (ziller)写在

comp.lang.c:
On 25 Aug 2004 17:06:31 -0700, zi****@gmail.com (ziller) wrote in
comp.lang.c:
为什么FLT_DIG(来自< float.h>)是6而DBL_DIB是15?


因为这是它提供的实施文件,因为C标准要求为
。 FLT_DIG和DBL_DIG分别需要在
和b至少6和10之间。

做数学运算,浮点数的尾数是24位= 2 ^ 24-1最大值
= 16,777,215.0f。任何比这更大的8位奇数#将被四舍五入。
对于双打,尾数是53位= 2 ^ 53-1最大值=
9,007,199,254,740,991.0l(即's'一个L)。因此,大于16位的奇数将被四舍五入。为了获得实际的精度,我们分别取这些数字的log(基数10)并分别得到7.22和15.95

......浮点数精度超过7位且双精度只有
大于15位数。那么MS如何保证没有舍入错误
15位双打但6位浮点数(如果我理解正确的话,
精度的最后一位必须用来舍入数字...... 数字不仅仅被截断为7和15位数......)

双打情况下我有什么遗漏?看起来他们应该保证14位数字。
Why is it that FLT_DIG (from <float.h>) is 6 while DBL_DIB is 15?
Because that is what the implementation documents that it provides, as
required by the C standard. FLT_DIG and DBL_DIG are required to be at
least 6 and 10 respectively.
Doing the math, the mantissa for floats is 24 bits = 2^24-1 max value
= 16,777,215.0f. Anything 8-digit odd # greater than that will be
rounded off.
For doubles, the mantissa is 53 bits = 2^53-1 max value =
9,007,199,254,740,991.0l (that''s an L). So 16 digit odd numbers
greater than that will be rounded off. To get the actual precision we
take log(base 10) of those numbers and get 7.22 and 15.95
respectively.

...floats have greater than 7 digits precision and doubles only
greater than 15 digits. So how does MS guarantee no rounding errors
for 15 digit doubles yet 6 digit floats (if I understand correctly,
the last digit of precision must be used to round off the number...the
numbers are not just truncated at 7 & 15 digits...)

Anything I''m missing for the doubles case? It looks like they should
be guaranteeing 14 digits.




你缺少的是C标准没有要求

对于没有舍入错误。事实上,几乎所有的浮点运算都保证了舍入错误。


这些术语的定义在C标准中有明确规定,

并且它根本没有说明舍入错误。基本上,这些

值表示可以在浮点类型中完全表示的最大小数位数。


如果FLT_DIG为6,那意味着可以将
-999,999到+999,999范围内的任何整数值放入浮点数,然后放入一个大的
足够的整数类型和结果将与

原始数字完全相同。


如果DBL_DIGIT为15,则表示范围内的任何整数值

-999,999,999,999,999到999,999,999,999,999可以放入

double,然后放入足够大的整数类型(如果存在)和

结果将完全相同作为原始价值。


根本没有提到四舍五入。


如果我认为你的意思是微软的32 -bit x86实现,你

在你的计算中有一些错误。不是计算

本身,但是您对英特尔FPU单精度和双精度类型(23和52)的尾数位数的假设是什么? b $ b,分别为24和53.


分别产生8,388,609和4,503,599,627,370,496

的范围。前者的数字范围为

,有7个十进制数字,后者为16位数。


< off-topic>


如果你想了解英特尔浮点

表示的实际格式,你可以免费下载文件
http://developer.intel.com 。如果你这样做,不要再费心看待80%的b $ b位扩展精度格式了。微软已经决定你不会以兼容性为代价来使用这种格式。

各种处理器上的Windows版本。


以下是微软的报价:


随着16 -bit Microsoft C / C ++编译器,长双精度存储为
80位(10字节)数据类型。在Windows NT下,为了与其他非英特尔浮点实现兼容



80位长双精度格式别名为64位(8 -byte)double

格式。


完整的网页可在以下网址找到:

http://support.microsoft.com/default...b;en -us; 129209


< / off-topic>


-

杰克Klein

主页: http://JK-Technology.Com

常见问题解答

comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html

comp.lang.c ++ http://www.parashift.com/c++-faq-lite/

alt.comp.lang.learn.c-c ++
http://www.contrib.andrew.cmu .edu / ~a ... FAQ-acllc.html



What you are missing is that the C standard imposes no requirements
for "no rounding errors". In fact rounding errors are guaranteed in
almost all floating point operations.

The definition of those terms is spelled out clearly in C standard,
and it says nothing at all about rounding errors. Basically, these
values represent the largest number of decimal digits that can be
fully represented in the floating point type.

If FLT_DIG is 6, that means that any integral value in the range of
-999,999 to +999,999 can be placed into a float and then into a large
enough integer type and result will be exactly the same as the
original number.

If DBL_DIGIT is 15, that means any integral value in the range
-999,999,999,999,999 to 999,999,999,999,999 can be placed into a
double and then into a large enough integer type (if one exists) and
the result will be exactly the same as the original value.

Nowhere is there any mention of rounding at all.

If I assume that you mean Microsoft''s 32-bit x86 implementations, you
have some errors in your calculations. Not the calculations
themselves, but your assumptions about the number of mantissa bits in
the Intel FPU single and double precision types, which are 23 and 52
respectively, not 24 and 53.

Which results in ranges of 8,388,609 and 4,503,599,627,370,496
respectively. There are 7 decimal digit numbers outside the range of
magnitude for the former, and 16 digit numbers for the latter.

<off-topic>

If you want to understand the actual format of Intel floating point
representations, you can download the documentation for free from
http://developer.intel.com. If you do, don''t bother looking at the 80
bit extended precision format. Microsoft has decided that you aren''t
qualified to use that format at the expense of "compatibility" among
Windows versions on various processors.

Here''s a quote from Microsoft:

With the 16-bit Microsoft C/C++ compilers, long doubles are stored as
80- bit (10-byte) data types. Under Windows NT, in order to be
compatible with other non-Intel floating point implementations, the
80-bit long double format is aliased to the 64-bit (8-byte) double
format.

The complete web page may be found at:

http://support.microsoft.com/default...b;en-us;129209

</off-topic>

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html


ziller< zi **** @ gmail.com>写道:
ziller <zi****@gmail.com> wrote:

为什么FLT_DIG(来自< float.h>)是6而DBL_DIB是15?

Why is it that FLT_DIG (from <float.h>) is 6 while DBL_DIB is 15?




由于舍入错误。 FLT_DIG的定义要求

*任何*可表示的数字,可以将多个十进制数字舍入

到一个浮点数并再次返回而不更改值。除非浮点数

存储在基数10(或10的幂)中,否则在两个转换中存在舍入错误,这两个转换在最坏的情况下都会复合。因此,C标准

表示在非十进制情况下使用的正确公式是:


floor((p-1)* log10(b ))


其中p是精度,b是基数。对于具有24和53

位精度的基础2,分别产生6和15。


-Larry Jones


这里有联系,我才知道。 - Calvin



Because of roundoff error. The definition of FLT_DIG requires that
*any* representable number with that many decimal digits can be rounded
into a float and back again without changing the value. Unless floats
are stored in base 10 (or a power of 10), there are roundoff errors on
both conversions that compound in the worst case. Thus, the C Standard
says the correct formula to use in the non-decimal case is:

floor((p-1)*log10(b))

where p is the precision and b is the base. For base 2 with 24 and 53
bits of precision, that yields 6 and 15 respectively.

-Larry Jones

There''s a connection here, I just know it. -- Calvin


这篇关于浮动限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆