浮点精度和操作顺序 [英] Floating point accuracy and order of operations

查看:87
本文介绍了浮点精度和操作顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为3D矢量对象及其代数(点积,叉积等)的类编写单元测试,只是观察到了我可以理解的行为,但还没有完全理解.

I'm writing a unit test for a class for 3D vector objects and its algebra (dot product, cross product etc.) and just observed a behavior I can somehow understand, but not to its fully extent.

我要做的实际上是生成2个伪随机向量bc,以及一个伪随机标量s,然后检查这些向量上不同运算的结果.

What I do is actually to generate 2 pseudorandom vectors, b and c, and a pseudorandom scalar, s, and subsequently check the results of different operations on those vectors.

b的组件在范围[-1, 1]中生成,而c的组件在范围[-1e6, 1e6]中生成,因为在我的用例中,我会遇到类似的情况,这可能会导致重大损失尾数中的信息s也在范围[-1, 1]中生成.

b's components are generated in the range [-1, 1], while c's component in the range [-1e6, 1e6], since in my use case I'll encounter similar situations, which could cause a significant lost of information in the mantissa. s is generated in the range [-1, 1] as well.

我用python(使用numpy)创建了一个MWE,目的只是为了更好地揭示我的问题(但是我实际上是用C ++编写的,而且该问题本身与语言无关):

I created a MWE in python (using numpy) just to expose better my question (but I'm actually coding in C++ and the question in itself is language agnostic):

b = np.array([0.4383006177615909, -0.017762134447941058, 0.56005552104818945])
c = np.array([-178151.26386435505, 159388.59511391702, -720098.47337336652])
s = -0.19796489160874975

然后我定义

d = s*np.cross(b,c)
e = np.cross(b,c)

最后计算

In [7]: np.dot(d,c)
Out[7]: -1.9073486328125e-06

In [8]: np.dot(e,c)
Out[8]: 0.0

In [9]: s*np.dot(e,c)
Out[9]: -0.0

由于de都垂直于bc,因此上面计算的标量积应全部为0(代数).

Since d and e are both perpendicular to b and c, the scalar products computed above should all give 0 (algebraically).

现在,对我来说很明显,在真正的计算机上,这只能在浮点运算的范围内实现.但是,我想更好地了解此错误是如何产生的.

Now, it's clear to me that in a real computer this can only achieved in the limits of floating point arithmetic. I however would like to better understand how this error generates.

实际上让我感到惊讶的是,这三个结果中的第一个结果的准确性很差.

What actually surprised me a bit is the poor accuracy of the first of the three results.

我将在以下内容中公开我的想法:

I'll try to expose my thoughts in the following:

  • np.cross(b, c)基本上是[b[1]*c[2]-b[2]*c[1], b[2]*c[0]-b[0]*c[2], ...],其中涉及将大数与小数相乘并相减. e(叉积b x c)本身保持相对较大的分量,即array([-76475.97678585, 215845.00681978, 66695.77300175])
  • 因此,要获得d,您仍需将相当大的分量乘以数字< 1.这当然会导致一些截断错误.
  • 使用点积e . c时,结果是正确的,而在d . c中,结果几乎是2e-6.最后一次乘以s可以导致如此大的差异吗?天真的想法是说,鉴于我的机器epsilon 2.22045e-16d的分量大小,误差应该在4e-11左右.
  • 在叉积中减去的尾数信息是否丢失?
  • np.cross(b, c) is basically [b[1]*c[2]-b[2]*c[1], b[2]*c[0]-b[0]*c[2], ...] which involves multiplication of a large and a small number and subsequent subtraction. e (the cross product b x c) itself keeps relative large components, i.e. array([-76475.97678585, 215845.00681978, 66695.77300175])
  • So, to get d you still multiply once pretty large components times a number <1. This of course will lead to some truncation error.
  • When taking the dot product e . c the result is correct, while in d . c the result is off of almost 2e-6. Can this last multiplication by s lead to such a big difference? A naïve thought would be to say that, given my machine epsilon of 2.22045e-16 and the magnitude of the components of d, the error should be around 4e-11.
  • Is the information of the mantissa lost in the subtraction taken in the cross product?

要检查最后一个想法,我做了:

To check that last thought, I did:

In [10]: d = np.cross(s*b,c)                                                    

In [11]: np.dot(d,c)                                                            
Out[11]: 0.0

In [12]: d = np.cross(b,s*c)                                                    

In [13]: np.dot(d,c)                                                            
Out[13]: 0.0

确实,在减法中,我失去了更多的信息.那是对的吗?怎么用浮点逼近来解释?

And it indeed appears that in the subtraction I loose much more information. Is that correct? How can that be explained in terms of floating point approximation?

这是否还意味着无论输入什么(即,无论两个向量的大小相似还是完全不同),总要首先执行所有涉及乘法(和除法)的运算更好? ,那么那些涉及加减法的人呢?

Also, does that mean that, regardless of the input (i.e., no matter if the two vectors are of similar magnitude or completely different), it is better to always perform first all the operations which involve multiplication (and division?), then those involving addition/subtraction?

推荐答案

信息的最大损失很可能发生在点积而不是叉积中.在叉积中,您得到的结果仍然接近c中条目的数量级.这意味着您可能损失了大约一位数的精度,但是相对误差仍然应该在10 ^ -15左右. (减法a-b中的相对误差大约等于2*(|a|+|b|) / (a-b))

The big loss of information most likely happens in the dot product and not in the cross product. In the cross product, the results that you get are still close to the order of magnitude of the entries in c. That means that you may have lost around one digit in precision, but the relative error should still be around 10^-15. (the relative error in the subtraction a-b is approximately equal to 2*(|a|+|b|) / (a-b))

点积是唯一涉及将两个非常接近的数字相减的运算.由于我们将先前的相对误差除以〜0,因此导致相对误差的极大增加.

The dot product it is the only operation involving a subtraction of two numbers that are very close to each other. This leads to an enormous increase in the relative error because we divide the previous relative error by ~0.

现在来看您的示例,考虑到您拥有的数量,您得到的误差(〜10 ^ -6)实际上是您期望的:ced的数量级为〜 10 ^ 5,这意味着绝对误差最多约为10 ^ -11.我不在乎s,因为它基本上等于1.

Now on to your example, the error that you get (~10^-6) is actually what you would expect considering the quantities that you have: c, e and d have a magnitude of ~10^5, which means that the absolute error is around 10^-11 at best. I don't care about s because it is basically equal to 1.

乘以a*b时的绝对误差大约为|a|*|err_b| + |b|*|err_a|(最坏的情况是误差不会抵消).现在,在点积中,您将2个数量级乘以〜10 ^ 5,因此误差应该在10^5*10^-11 + 10^5*10^-11 = 2*10^-6范围内(并乘以3,因为您对每个分量执行了3次).

The absolute error when you multiply a*b is approximately |a|*|err_b| + |b|*|err_a| (worst case scenario where the errors don't cancel out). now in the dot product, you multiply 2 quantities with magnitude ~10^5, so the error should be in the range of 10^5*10^-11 + 10^5*10^-11 = 2*10^-6 (and multiply by 3 because you do this 3 times, for each component).

然后,如果10 ^ -6是预期的错误,我该如何解释您的结果?好吧,您很幸运:使用这些值(我更改了b[0]c[0])

Then if 10^-6 is the expected error, how can I explain your results? Well, you were lucky: using these values (I changed b[0] and c[0])

b = np.array([0.4231830061776159, -0.017762134447941058, 0.56005552104818945])
c = np.array([-178151.28386435505, 159388.59511391702, -720098.47337336652])
s = -0.19796489160874975

我(按顺序)

-1.9073486328125e-06
7.62939453125e-06
-1.5103522614192943e-06

-1.9073486328125e-06
-1.9073486328125e-06

另外,当您查看相对误差时,它做得很好:

Also, when you look at the relative error, it's doing a pretty good job:

In [10]: np.dot(d,c)
Out[11]: -1.9073486328125e-06

In [11]: np.dot(d,c) / (np.linalg.norm(e)*np.linalg.norm(c))
Out[11]: -1.1025045691772927e-17

关于运算顺序,只要您不减去2个非常接近的数字,我认为这没什么大不了的.如果您仍然需要减去2个非常接近的数字,那么我想最好还是这样做(不要把所有东西搞砸了),但不要在那儿引用我.

Regarding the order of operations, I don't think it matters that much, as long as you are not subtracting 2 very close numbers. If you still need to subtract 2 very close numbers, I guess it would be better to do it in the end (not screwing everything up) but don't quote me on that.

这篇关于浮点精度和操作顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆