在 C 中最小化浮点错误的经验法则? [英] Rules-of-thumb for minimising floating-point errors in C?

查看：26 发布时间：2021/11/24 22:59:38 c floating-point 32-bit floating-accuracy

本文介绍了在 C 中最小化浮点错误的经验法则?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

关于最小化浮点运算中的错误，如果我在 C 中有如下操作:

float a = 123.456;浮动 b = 456.789;浮动 r = 0.12345;a = a - (r * b);

如果我将乘法和减法步骤分开，计算结果是否会改变，即:

float c = r * b;a = a - c;

我想知道 CPU 是否会以不同的方式处理这些计算，从而在一种情况下误差可能更小?

如果不是，我认为无论如何，是否有任何好的经验法则可以减轻浮点错误?我可以以一种有帮助的方式处理数据吗?

请不要只说使用更高的精度"——这不是我想要的.

编辑

关于数据的信息，一般来说，当运算结果像 123456789 这样的非常大的数字时，错误似乎更严重.小的数字，例如 1.23456789，在运算后似乎会产生更准确的结果.我是在想象这一点，还是缩放更大的数字有助于提高准确性?

解决方案

注意:这个答案开始于对 a = a - (r * b); 和 之间的区别的冗长讨论> 浮动 c = r * b;a = a - c; 使用符合 c99 的编译器.关于提高准确率的目标，同时避免扩展精度的问题在最后被覆盖.

中间结果的扩展浮点精度

如果您的 C99 编译器定义 FLT_EVAL_METHOD 为 0，则可以预期两次计算会产生完全相同的结果.如果编译器将 FLT_EVAL_METHOD 定义为 1 或 2，则 a = a - (r * b); 对于 a、r 和 b，因为所有中间计算都将在扩展精度下完成(double 用于值 1 和 long double 用于值 2).

程序无法设置FLT_EVAL_METHOD，但您可以使用命令行选项更改编译器使用浮点计算的方式，这将使其定义相应地更改.

一些中间结果的收缩

根据您是否在程序中使用 #pragma fp_contract 以及编译器对此 pragma 的默认值，某些复合浮点表达式可以压缩为单个指令表现得好像中间结果是以无限精度计算的.当以现代处理器为目标时，这恰好是您的示例的可能性，因为 fused-乘加指令将直接计算a，并在浮点类型允许的范围内尽可能准确.

但是，您应该记住，收缩仅发生在编译器的选项中，没有任何保证.编译器使用 FMA 指令来优化速度，而不是准确性，因此转换可能不会在较低的优化级别发生.有时可能有几种转换(例如，a * b + c * d 可以计算为 fmaf(c, d, a*b) 或 fmaf(a, b, c*d)) 编译器可以选择其中之一.

简而言之，浮点计算的收缩并不是为了帮助您实现准确性.如果您喜欢可重现的结果，不妨确保将其禁用.

但是，在融合乘加复合运算的特殊情况下，您可以使用 C99 标准函数 fmaf() 告诉编译器在单个步骤中计算乘法和加法用一个四舍五入.如果你这样做，那么除了 a 的最佳结果之外，编译器将不会被允许产生任何其他结果.

<前>浮动 fmaf(浮动 x，浮动 y，浮动 z)；描述fma() 函数计算 (x*y)+z，四舍五入为一个三元运算:他们将值(好像)计算到无限精度并四舍五入一次结果格式，根据当前的舍入模式.

请注意，如果 FMA 指令不可用，则您的编译器对 fmaf() 函数的实现最多将只需使用更高的精度，如果您的编译平台发生这种情况，您不妨使用 double 对于累加器:它比使用 fmaf() 更快、更准确.在最坏的情况下，将提供有缺陷的 fmaf() 实现.

仅使用单精度提高精度

如果您的计算涉及一长串加法，请使用Kahan summation.假设有许多项，可以通过简单地将计算为单精度乘积的 r*b 项相加来获得一些准确性.如果您希望获得更高的准确性，您可能希望将 r*b 本身精确地计算为两个单精度数的总和，但如果您这样做，您不妨切换到双单算术完全.双单算术将与此处简洁描述的双倍技术相同，但用单精度数字代替.

Regarding minimising the error in floating-point operations, if I have an operation such as the following in C:

float a = 123.456;
float b = 456.789;
float r = 0.12345;
a = a - (r * b);

Will the result of the calculation change if I split the multiplication and subtraction steps out, i.e.:

float c = r * b;
a = a - c;

I am wondering whether a CPU would then treat these calculations differently and thereby the error may be smaller in one case?

If not, which I presume anyway, are there any good rules-of-thumb to mitigate against floating-point error? Can I massage data in a way that will help?

Please don't just say "use higher precision" - that's not what I'm after.

EDIT

For information about the data, in the general sense errors seem to be worse when the operation results in a very large number like 123456789. Small numbers, such as 1.23456789, seem to yield more accurate results after operations. Am I imagining this, or would scaling larger numbers help accuracy?

解决方案

Note: this answer starts with a lengthy discussion of the distinction between a = a - (r * b); and float c = r * b; a = a - c; with a c99-compliant compiler. The part of the question about the goal of improving accuracy while avoiding extended precision is covered at the end.

Extended floating-point precision for intermediate results

If your C99 compiler defines FLT_EVAL_METHOD as 0, then the two computations can be expected to produce exactly the same result. If the compiler defines FLT_EVAL_METHOD to 1 or 2, then a = a - (r * b); will be more precise for some values of a, r and b, because all intermediate computations will be done at an extended precision (double for the value 1 and long double for the value 2).

The program cannot set FLT_EVAL_METHOD, but you can use commandline options to change the way your compiler computes with floating-point, and that will make it change its definition accordingly.

Contraction of some intermediate results

Depending whether you use #pragma fp_contract in your program and on your compiler's default value for this pragma, some compound floating-point expressions can be contracted into single instructions that behave as if the intermediate result was computed with infinite precision. This happens to be a possibility for your example when targeting a modern processor, as the fused-multiply-add instruction will compute a directly and as accurately as allowed by the floating-point type.

However, you should bear in mind that the contraction only take place at the compiler's option, without any guarantees. The compiler uses the FMA instruction to optimize speed, not accuracy, so the transformation may not take place at lower optimization levels. Sometimes several transformations are possible (e.g. a * b + c * d can be computed either as fmaf(c, d, a*b) or as fmaf(a, b, c*d)) and the compiler may choose one or the other.

In short, the contraction of floating-point computations is not intended to help you achieve accuracy. You might as well make sure it is disabled if you like reproducible results.

However, in the particular case of the fused-multiply-add compound operation, you can use the C99 standard function fmaf() to tell the compiler to compute the multiplication and addition in a single step with a single rounding. If you do this, then the compiler will not be allowed to produce anything else than the best result for a.


     float fmaf(float x, float y, float z);

DESCRIPTION
     The fma() functions compute (x*y)+z, rounded as one ternary operation:
     they compute the value (as if) to infinite precision and round once to
     the result format, according to the current rounding mode.

Note that if the FMA instruction is not available, your compiler's implementation of the function fmaf() will at best just use higher precision, and if this happens on your compilation platform, your might just as well use the type double for the accumulator: it will be faster and more accurate than using fmaf(). In the worst case, a flawed implementation of fmaf() will be provided.

Improving accuracy while only using single-precision

Use Kahan summation if your computation involves a long chain of additions. Some accuracy can be gained by simply summing the r*b terms computed as single-precision products, assuming there are many of them. If you wish to gain more accuracy, you might want to compute r*b itself exactly as the sum of two single-precision numbers, but if you do this you might as well switch to double-single arithmetics entirely. Double-single arithmetics would be the same as the double-double technique succinctly described here, but with single-precision numbers instead.

这篇关于在 C 中最小化浮点错误的经验法则?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 C 中最小化浮点错误的经验法则? [英] Rules-of-thumb for minimising floating-point errors in C?

问题描述

中间结果的扩展浮点精度

一些中间结果的收缩

仅使用单精度提高精度

Extended floating-point precision for intermediate results

Contraction of some intermediate results

Improving accuracy while only using single-precision

相关文章

C#最新文章

热门教程

热门工具

登录关闭

在 C 中最小化浮点错误的经验法则? [英] Rules-of-thumb for minimising floating-point errors in C?

问题描述

中间结果的扩展浮点精度

一些中间结果的收缩

仅使用单精度提高精度

Extended floating-point precision for intermediate results

Contraction of some intermediate results

Improving accuracy while only using single-precision

相关文章

C#最新文章

热门教程

热门工具

登录 关闭

登录关闭