浮点算术运算的精度是多少? [英] What precision are floating-point arithmetic operations done in?

查看:88
本文介绍了浮点算术运算的精度是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下两个非常简单的乘法:

Consider two very simple multiplications below:

double result1;
long double result2;
float var1=3.1;
float var2=6.789;
double var3=87.45;
double var4=234.987;

result1=var1*var2;
result2=var3*var4;

默认情况下,乘法运算的精度是否比操作数更高?我的意思是在第一次乘法的情况下以双精度完成,而在x86架构的第二乘法的情况下以80位扩展精度完成,还是我们应该像下面那样将表达式中的操作数自身转换为更高的精度?

Are multiplications by default done in a higher precision than the operands? I mean in case of first multiplication is it done in double precision and in case of second one in x86 architecture is it done in 80-bit extended-precision or we should cast operands in expressions to the higher precision ourselves like below?

result1=(double)var1*(double)var2;
result2=(long double)var3*(long double)var4;

其他操作(加,除和余数)呢?例如,当将两个以上的单精度正值相加时,使用双精度的额外有效位可以减少舍入误差(如果用于保留表达式的中间结果).

What about other operations(add, division and remainder)? For example when adding more than two positive single-precision values, using extra significant bits of double-precision can decrease round-off errors if used to hold intermediate results of expression.

推荐答案

浮点计算的精度

C ++ 11 并入了 FLT_EVAL_METHOD 的定义来自C99中的 cfloat .

Precision of floating-point computations

C++11 incorporates the definition of FLT_EVAL_METHOD from C99 in cfloat.


FLT_EVAL_METHOD     

Possible values:
-1 undetermined
 0 evaluate just to the range and precision of the type
 1 evaluate float and double as double, and long double as long double.
 2 evaluate all as long double 

如果您的编译器将 FLT_EVAL_METHOD 定义为2,则 r1 r2 以及 s1 的计算和 s2 分别等效:

If your compiler defines FLT_EVAL_METHOD as 2, then the computations of r1 and r2, and of s1 and s2 below are respectively equivalent:

double var3 = …;
double var4 = …;

double r1 = var3 * var4;
double r2 = (long double)var3 * (long double)var4;

long double s1 = var3 * var4;
long double s2 = (long double)var3 * (long double)var4;

如果您的编译器将FLT_EVAL_METHOD定义为2,则在上述所有四个计算中,乘法都是以 long double 类型的精度完成的.

If your compiler defines FLT_EVAL_METHOD as 2, then in all four computations above, the multiplication is done at the precision of the long double type.

但是,如果编译器将 FLT_EVAL_METHOD 定义为0或1,则 r1 r2 分别是 s1 s2 并不总是相同的.计算 r1 s1 时的乘法以 double 的精度完成.计算 r2 s2 时的乘法以 long double 的精度完成.

However, if the compiler defines FLT_EVAL_METHOD as 0 or 1, r1 and r2, and respectively s1 and s2, aren't always the same. The multiplications when computing r1 and s1 are done at the precision of double. The multiplications when computing r2 and s2 are done at the precision of long double.

如果您要计算的结果注定要存储在比操作数类型更宽的结果类型中,那么您的问题中的 result1 result2 应该始终将参数转换为至少与目标一样宽的类型,如此处所示:

If you are computing results that are destined to be stored in a wider result type than the type of the operands, as are result1 and result2 in your question, you should always convert the arguments to a type at least as wide as the target, as you do here:

result2=(long double)var3*(long double)var4;

不进行此转换(如果您编写 var3 * var4 ),如果 FLT_EVAL_METHOD 的编译器定义为0或1,则乘积将以 double ,这很可惜,因为它注定要存储在 long double 中.

Without this conversion (if you write var3 * var4), if the compiler's definition of FLT_EVAL_METHOD is 0 or 1, the product will be computed in the precision of double, which is a shame, since it is destined to be stored in a long double.

如果编译器将 FLT_EVAL_METHOD 定义为2,则不需要在(long double)var3 *(long double)var4 中进行转换,但它们也不会对两者造成损害:无论有无表达式,该表达式的含义完全相同.

If the compiler defines FLT_EVAL_METHOD as 2, then the conversions in (long double)var3*(long double)var4 are not necessary, but they do not hurt either: the expression means exactly the same thing with and without them.

矛盾的是,对于单个操作,最好仅将一次舍入到目标精度一次.计算扩展精度的单倍乘法的唯一作用是将结果四舍五入为扩展精度,然后四舍五入为 double 精度.这使得它准确性较差.换句话说,在 FLT_EVAL_METHOD 为0或1的情况下,上述结果 r2 有时不如 r1 准确,原因是经过了两次四舍五入,如果编译器使用IEEE 754浮点,再好不过了.

Paradoxically, for a single operation, rounding only once to the target precision is best. The only effect of computing a single multiplication in extended precision is that the result will be rounded to extended precision and then to double precision. This makes it less accurate. In other words, with FLT_EVAL_METHOD 0 or 1, the result r2 above is sometimes less accurate than r1 because of double-rounding, and if the compiler uses IEEE 754 floating-point, never better.

对于包含多个操作的较大表达式,情况有所不同.对于这些,通常最好是通过显式转换或由于编译器使用 FLT_EVAL_METHOD == 2 来以扩展的精度计算中间结果.此问题及其接受的答案表明当对二进制64 IEEE 754参数和结果进行80位扩展精度中间计算时,插值公式 u2 *(1.0-u1)+ u1 * u3 始终会产生 u2之间的结果 u3 表示介于0到1之间的 u1 .此属性可能对二进制64精度中间计算不成立,因为这样会产生较大的舍入误差.

The situation is different for larger expressions that contain several operations. For these, it is usually better to compute intermediate results in extended precision, either through explicit conversions or because the compiler uses FLT_EVAL_METHOD == 2. This question and its accepted answer show that when computing with 80-bit extended precision intermediate computations for binary64 IEEE 754 arguments and results, the interpolation formula u2 * (1.0 - u1) + u1 * u3 always yields a result between u2 and u3 for u1 between 0 and 1. This property may not hold for binary64-precision intermediate computations because of the larger rounding errors then.

这篇关于浮点算术运算的精度是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆