简单的浮点数丢失精度 [英] simple floating-point numbers lose precision

查看:278
本文介绍了简单的浮点数丢失精度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是Delphi XE2 Update 3.即使是最简单的浮点数(如 3.7 ),也存在精度问题。给定此代码(32位控制台应用程序):

 程序Project1; 

{$ APPTYPE CONSOLE}
{$ R * .res}

使用System.SysUtils;

var s:单; d:双x:扩展;
begin
Write('Size of Single -----'); Writeln(SizeOf(Single));
写('Double of Double -----'); Writeln(SizeOf(Double));
写('Size of Extended ---'); Writeln(SizeOf(Extended));作家

s:= 3.7; d:= 3.7; x:= 3.7;

写(''sis'); Writeln(s);
写('d是');书面(d);
写('xis'); Writeln(x);作家

Writeln('Single Comparison');
写(''s> 3.7is'); Writeln(s> 3.7);
写(''s = 3.7is'); Writeln(s = 3.7);
写('s <3.7为'); Writeln(s <3.7);作家

Writeln('Double Comparison');
写('d> 3.7is'); Writeln(d> 3.7);
写('d = 3.7is'); Writeln(d = 3.7);
写('d <3.7是'); Writeln(d <3.7);作家

Writeln('扩展比较');
写('x> 3.7is'); Writeln(x> 3.7);
写('x = 3.7is); Writeln(x = 3.7);
Write('x <3.7is'); Writeln(x <3.7);阅读
结束。

我得到这个输出:

 单个的尺寸----- 4 
双倍的大小----- 8
扩展的大小--- 10

s是3.70000004768372E + 0000
d是3.70000000000000E + 0000
x是3.70000000000000E + 0000

单一比较
s> 3.7 是TRUE
s = 3.7是FALSE
s< 3.7是FALSE

双重比较
d> 3.7是TRUE
d = 3.7是FALSE
d< 3.7是FALSE

扩展比较
x> 3.7是FALSE
x = 3.7是TRUE
x< 3.7是FALSE

您可以看到 extension 是唯一可以正确评估的类型。我认为使用复杂的浮点数,如 3.14159265358979323846 而不是简单的 3.7 。使用 single 的问题是有道理的。但是为什么不能 double work?

解决方案

必需阅读:每个计算机科学家应该了解的关于浮点运算的问题,David Goldberg 。



这个问题不是一个精度。相反,问题是可表现性之一。首先,让我们重新标明浮点数用于表示实数。有无数的实数。当然,整数也可以这样说。但是这里的区别在于,在特定范围内,存在有限数量的整数,而是无限数量的实数。实际上,由Cantor 最初显示的,实数的任何有限间隔都包含不可数数量的实际值。 p>

所以很明显,我们不能在有限的机器上表示所有的实数。那么我们可以代表哪些数字?那么这取决于数据类型。 Delphi浮点数据类型使用二进制表示。单(32位)和双(64位)类型符合IEEE-754标准。扩展(80位)类型是Intel特定类型。在二进制浮点中,可表示的数字具有形式k2 n ,其中k和n是整数。请注意,我并不声称此表格的所有数字均可表示。这是不可能的,因为这样的数字是无限数量的。相反,我的观点是所有可代表的数字都是这种形式。



可表示的二进制浮点数的一些示例包括:1,0.5,0.25,0.75,1.25,0.125, 0.375。您的价值3.7,不能作为二进制浮点值表示。



与代码相关的意思是,这并不代表您期望的内容做。你希望与价值3.7进行比较。但是,您正在将最接近的值与3.7进行比较。作为实现细节的问题,这个最接近的值是在扩展精度的上下文中。这就是为什么看起来使用扩展版本的版本是你期望的。但是,不要这么说,您的变量 x 等于3.7。实际上它等于最接近的可表示的扩展精度值为3.7。



Rob Kennedy的最有用的网站可以向您显示特定数字最接近的可表示值。在3.7的情况下,这些是:

 
3.7 = + 3.70000 00000 00000 00004 33680 86899 42017 73602 98112 03479 76684 57031 25
3.7 = + 3.70000 00000 00000 17763 56839 40025 04646 77810 66894 53125
3.7 = + 3.70000 00476 83715 82031 25

在顺序扩展,双,单。换句话说,这些是您的变量 x d s



如果您查看这些值,并将它们与最接近扩展为3.7的值进行比较,您将看到为什么程序生成它的输出。这里的单精度值和双精度值都大于扩展值。你的程序告诉你什么



我不想对如何比较浮点值进行全面的建议。这样做的最好办法总是在很大程度上取决于具体的问题。没有一个有用的建议可以给予。


I'm using Delphi XE2 Update 3. There are precision issue with even the simplest of floating-point numbers (like 3.7). Given this code (a 32-bit console app):

program Project1;

{$APPTYPE CONSOLE}
{$R *.res}

uses System.SysUtils;

var s: Single; d: Double; x: Extended;
begin
  Write('Size of Single  -----  ');  Writeln(SizeOf(Single));
  Write('Size of Double  -----  ');  Writeln(SizeOf(Double));
  Write('Size of Extended  ---  ');  Writeln(SizeOf(Extended));  Writeln;

  s := 3.7;  d := 3.7;  x := 3.7;

  Write('"s" is ');                  Writeln(s);
  Write('"d" is ');                  Writeln(d);
  Write('"x" is ');                  Writeln(x);                 Writeln;

  Writeln('Single Comparison');
  Write('"s > 3.7"  is  ');          Writeln(s > 3.7);
  Write('"s = 3.7"  is  ');          Writeln(s = 3.7);
  Write('"s < 3.7"  is  ');          Writeln(s < 3.7);           Writeln;

  Writeln('Double Comparison');
  Write('"d > 3.7"  is  ');          Writeln(d > 3.7);
  Write('"d = 3.7"  is  ');          Writeln(d = 3.7);
  Write('"d < 3.7"  is  ');          Writeln(d < 3.7);           Writeln;

  Writeln('Extended Comparison');
  Write('"x > 3.7"  is  ');          Writeln(x > 3.7);
  Write('"x = 3.7"  is  ');          Writeln(x = 3.7);
  Write('"x < 3.7"  is  ');          Writeln(x < 3.7);           Readln;
end.

I get this output:

Size of Single  -----  4
Size of Double  -----  8
Size of Extended  ---  10

"s" is  3.70000004768372E+0000
"d" is  3.70000000000000E+0000
"x" is  3.70000000000000E+0000

Single Comparison
"s > 3.7"  is  TRUE
"s = 3.7"  is  FALSE
"s < 3.7"  is  FALSE

Double Comparison
"d > 3.7"  is  TRUE
"d = 3.7"  is  FALSE
"d < 3.7"  is  FALSE

Extended Comparison
"x > 3.7"  is  FALSE
"x = 3.7"  is  TRUE
"x < 3.7"  is  FALSE

You can see extended is the only type that evaluates correctly. I thought precision was only an issue when using a complex floating-point number like 3.14159265358979323846, not something as simple as 3.7. The issue when using single kind of makes sense. But why doesn't double work?

解决方案

Required reading: What Every Computer Scientist Should Know About Floating-Point Arithmetic, David Goldberg.

The issue is not one of precision. Rather the issue is one of representability. First of all, let us re-cap that floating point numbers are used to represent real numbers. There are an infinite quantity of real numbers. Of course, the same can be said of integers. But the difference here is that within a particular range, there are a finite number of integers but an infinite number of real numbers. Indeed as was originally shown by Cantor, any finite interval of real numbers contains an uncountable number of real values.

So it is clear that we cannot represent all real numbers on a finite machine. So, which numbers can we represent? Well, that depends on the data type. Delphi floating point data types use binary representation. The single (32 bit) and double (64 bit) types adhere to the IEEE-754 standard. The extended (80 bit) type is an Intel specific type. In binary floating point a representable number has the form k2n where k and n are integers. Note that I am not claiming that all numbers of this form are representable. That is not possible because there are an infinite quantity of such numbers. Rather my point is that all representable numbers are of this form.

Some examples of representable binary floating point numbers include: 1, 0.5, 0.25, 0.75, 1.25, 0.125, 0.375. Your value, 3.7, is not representable as a binary floating point value.

What this means in relation to your code is that none of it is doing what you expect it to do. You are hoping to compare against the value 3.7. But instead you are comparing against the nearest exactly representably value to 3.7. As a matter of implementation detail, this nearest exactly representably value is in the context of extended precision. Which is why it appears that the version using extended does what you expect. However, do not take this to mean that your variable x is equal to 3.7. In fact it is equal to the nearest representable extended precision value to 3.7.

Rob Kennedy's most useful website can show you the closest representable values to a specific number. In the case of 3.7 these are:

3.7 = + 3.70000 00000 00000 00004 33680 86899 42017 73602 98112 03479 76684 57031 25
3.7 = + 3.70000 00000 00000 17763 56839 40025 04646 77810 66894 53125
3.7 = + 3.70000 00476 83715 82031 25

These are presented in the order extended, double, single. In other words these are the values of your variables x, d and s respectively.

If you look at these values, and compare them with the closest extended to 3.7 you will see why your program produces the output that it does. Both the single and double precision values here are greater than the extended. Which is what your program told you.

I don't want to make any blanket recommendations as to how to compare floating point values. The best way to do that always depends very critically on the specific problem. No blanket advice can be usefully given.

这篇关于简单的浮点数丢失精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆