如何避免单元测试中的浮点舍入错误? [英] How to avoid floating point round off error in unit tests?

查看:79
本文介绍了如何避免单元测试中的浮点舍入错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为一些简单的矢量数学函数编写单元测试,这些函数可对单精度浮点数的数组进行操作.这些函数使用SSE内在函数,并且在32位系统(测试通过64位)上运行时,我得到了误报(至少在我看来).随着操作遍历数组,我累积了越来越多的舍入错误.这是单元测试代码和输出的片段(我的实际问题如下):

I'm trying to write unit tests for some simple vector math functions that operate on arrays of single precision floating point numbers. The functions use SSE intrinsics and I'm getting false positives (at least I think) when running the tests on a 32-bit system (the tests pass on 64-bit). As the operation runs through the array, I accumulate more and more round off error. Here is a snippet of unit test code and output (my actual question(s) follow):

测试设置:

static const int N = 1024;
static const float MSCALAR = 42.42f;

static void setup(void) {
    input = _mm_malloc(sizeof(*input) * N, 16);
    ainput = _mm_malloc(sizeof(*ainput) * N, 16);
    output = _mm_malloc(sizeof(*output) * N, 16);
    expected = _mm_malloc(sizeof(*expected) * N, 16);

    memset(output, 0, sizeof(*output) * N);

    for (int i = 0; i < N; i++) {
        input[i] = i * 0.4f;
        ainput[i] = i * 2.1f;
        expected[i] = (input[i] * MSCALAR) + ainput[i];
    }
}

然后,我的主要测试代码调用要测试的函数(进行与生成expected数组相同的计算),并对照上面生成的expected数组检查其输出.该检查是针对紧密度(在0.0001内)而不是相等的.

My main test code then calls the function to be tested (which does the same calculation used to generate the expected array) and checks its output against the expected array generated above. The check is for closeness (within 0.0001) not equality.

示例输出:

0.000000    0.000000    delta: 0.000000
44.419998   44.419998   delta: 0.000000
...snip 100 or so lines...
2043.319946 2043.319946 delta: 0.000000
2087.739746 2087.739990 delta: 0.000244
...snip 100 or so lines...
4086.639893 4086.639893 delta: 0.000000
4131.059570 4131.060059 delta: 0.000488
4175.479492 4175.479980 delta: 0.000488
...etc, etc...

我知道我有两个问题:

  1. 在32位计算机上,387和SSE浮点算术单元之间的差异.我相信387将更多位用于中间值.
  2. 我用于生成期望值的42.42值的非精确表示.

所以我的问题是,为浮点数据的数学运算编写有意义的便携式单元测试的正确方法是什么?

So my question is, what is the proper way to write meaningful and portable unit tests for math operations on floating point data?

*通过便携式,我的意思是应该同时传递32位和64位体系结构.

*By portable I mean should pass on both 32 and 64 bit architectures.

推荐答案

通过注释,我们看到要测试的函数本质上是:

Per a comment, we see that the function being tested is essentially:

for (int i = 0; i < N; ++i)
    D[i] = A[i] * b + C[i];

其中A[i]bC[i]D[i]都具有类型float.当引用单个迭代的数据时,对于A[i]C[i]D[i],我将使用acd.

where A[i], b, C[i], and D[i] all have type float. When referring to the data of a single iteration, I will use a, c, and d for A[i], C[i], and D[i].

下面是对在测试此功能时可以用于容错的内容的分析.不过,我首先要指出的是,我们可以设计测试,以确保没有错误.我们可以选择A[i]bC[i]D[i]的值,以便最终结果和中间结果的所有结果都可以精确表示,并且没有舍入误差.显然,这不会测试浮点算法,但这不是目标.目的是测试功能代码:它是否执行计算所需功能的指令?只需选择将揭示使用正确数据,添加,乘法或存储到正确位置失败的任何值,就足以揭示函数中的错误.我们相信硬件可以正确执行浮点运算,并且不会对其进行测试;我们只想测试该函数是否正确编写.为此,例如,可以将b设置为2的幂,将A[i]设置为各种小整数,并且将C[i]设置为各种小整数乘以b.如果需要,我可以更精确地详细说明这些值的限制.这样所有的结果都是准确的,而允许比较的公差将消失.

Below is an analysis of what we could use for an error tolerance when testing this function. First, though, I want to point out that we can design the test so that there is no error. We can choose the values of A[i], b, C[i], and D[i] so that all the results, both final and intermediate results, are exactly representable and there is no rounding error. Obviously, this will not test the floating-point arithmetic, but that is not the goal. The goal is to test the code of the function: Does it execute instructions that compute the desired function? Simply choosing values that would reveal any failures to use the right data, to add, to multiply, or to store to the right location will suffice to reveal bugs in the function. We trust that the hardware performs floating-point correctly and are not testing that; we just want to test that the function was written correctly. To accomplish this, we could, for example, set b to a power of two, A[i] to various small integers, and C[i] to various small integers multiplied by b. I could detail limits on these values more precisely if desired. Then all results would be exact, and any need to allow for a tolerance in comparison would vanish.

此外,让我们继续进行错误分析.

That aside, let us proceed to error analysis.

目标是发现功能实现中的错误.为此,我们可以忽略浮点运算中的小错误,因为我们正在寻找的各种错误几乎总是会导致大错误:使用错误的操作,使用错误的数据或结果没有存储在所需位置,因此实际结果几乎总是与预期结果有很大差异.

The goal is to find bugs in the implementation of the function. To do this, we can ignore small errors in the floating-point arithmetic, because the kinds of bugs we are seeking almost always cause large errors: The wrong operation is used, the wrong data is used, or the result is not stored in the desired location, so the actual result is almost always very different from the expected result.

现在的问题是,我们应该容忍多少错误?由于错误通常会导致较大的错误,因此我们可以将公差设置得很高.但是,在浮点数中,高"仍然是相对的.与数万亿的值相比,一百万的误差很小,但是当输入值在数万亿的值中时,发现误差就太大了.因此,我们应该至少进行一些分析来确定级别.

Now the question is how much error should we tolerate? Because bugs will generally cause large errors, we can set the tolerance quite high. However, in floating-point, "high" is still relative; an error of one million is small compared to values in the trillions, but it is too high to discover errors when the input values are in the ones. So we ought to do at least some analysis to decide the level.

要测试的功能将使用SSE内在函数.这意味着它将对上面循环中的每个i执行浮点乘法和浮点加法,或者将执行融合的浮点乘法加法.后者中的潜在错误是前者的一个子集,因此我将使用前者. a*b+c的浮点运算会进行一些舍入,以便它们计算出大约a•b + c的结果(解释为精确的数学表达式,而不是浮点).我们可以写出一些误差e0和e1的精确值计算为(a•b•(1+e0)+c)•(1+e1),这些误差的大小最大为2 -24 ,条件是所有值都在浮点格式的正常范围内. (2 -24 是在IEEE-754 32位二进制浮点格式的舍入到最近模式下任何正确舍入的基本浮点运算中可能发生的最大相对误差.在最近舍入模式下,数学值最多更改有效位中最低有效位的值的一半,该值比最高有效位低23位.)

The function being tested will use SSE intrinsics. This means it will, for each i in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1) for some errors e0 and e1 with magnitudes at most 2-24, provided all the values are in the normal range of the floating-point format. (2-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)

接下来,我们考虑测试程序为其预期值产生的值.它使用C代码d = a*b + c;. (我已将问题中的长名称转换为短名称.)理想情况下,这还将在IEEE-754 32位二进制浮点中计算乘和加.如果是这样,则结果将与被测试的功能相同,并且不需要在比较中留出任何公差.但是,C标准允许实现在执行浮点算术时有一定的灵活性,并且有些不一致的实现要比标准允许更多的自由.

Next, we consider what value the test program produces for its expected value. It uses the C code d = a*b + c;. (I have converted the long names in the question to shorter names.) Ideally, this would also calculate a multiply and an add in IEEE-754 32-bit binary floating-point. If it did, then the result would be identical to the function being tested, and there would be no need to allow for any tolerance in comparison. However, the C standard allows implementations some flexibility in performing floating-point arithmetic, and there are non-conforming implementations that take more liberties than the standard allows.

一个常见的行为是要比其名义类型更精确地计算表达式.某些编译器可能使用doublelong double算术来计算a*b + c. C标准要求将结果转换为强制类型转换或赋值形式;必须放弃额外的精度.如果C实现使用的是额外的精度,则计算将继续:a*b的计算具有额外的精度,因此精确地产生a•b,因为doublelong double具有足够的精度以精确表示任意两个值.然后,C实现可能会将结果四舍五入到float.这不太可能,但无论如何我都允许.但是,我也忽略了它,因为它使预期结果更接近被测试函数的结果,并且我们只需要知道可能发生的最大错误即可.因此,在更糟(更遥远)的情况下,我将继续说,到目前为止的结果是a•b.然后添加c,得出一些e2的(a•b + c)•(1 + e2)的大小最大为2 -53 (在64-位二进制格式).最后,将此值转换为float以分配给d,对于某些幅度最大为2 -24的e3产生(a•b + c)•(1 + e2)•(1 + e3) .

A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c using double or long double arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b is calculated with extra precision, yielding exactly a•b, because double and long double have enough precision to exactly represent the product of any two float values. A C implementation might then round this result to float. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float for assignment to d, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2-24.

现在,我们有了由正确的操作函数(a•b•(1 + e0)+ c)•(1 + e1)计算出的精确结果的表达式,以及由测试代码计算出的精确结果的表达式( a•b + c)•(1 + e2)•(1 + e3),我们可以计算出它们可以相差多少的界限.简单的代数告诉我们确切的差是a•b•(e0 + e1 + e0•e1-e2-e3-e2•e3)+ c•(e1-e2-e3-e2•e3).这是e0,e1,e2和e3的简单函数,我们可以看到其极端情况出现在e0,e1,e2和e3的电势值的端点处.由于值的符号的可能性之间的相互作用,因此会带来一些复杂性,但是在最坏的情况下,我们可以简单地允许一些额外的错误.差异最大幅度的界为| a•b |•(3•2 -24 +2 -53 +2 -48 )+ | c |•(2•2 -24 +2 -53 +2 -77 ).

Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2-24+2-53+2-48)+|c|•(2•2-24+2-53+2-77).

因为我们有足够的空间,所以我们可以简化它,只要我们朝着增大值的方向进行即可.例如,使用| a•b |•3.001•2 -24 + | c |•2.001•2 -24 可能会很方便.该表达式应足以在检测几乎所有实现错误的同时舍入浮点计算.

Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2-24+|c|•2.001•2-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.

请注意,表达式与最终值a*b+c不成正比,最终值由被测试函数或测试程序计算得出.这通常意味着使用相对于被测函数或测试程序计算出的最终值的容差进行的测试是错误的.测试的正确形式应该是这样的:

Note that the expression is not proportional to the final value, a*b+c, as calculated either by the function being tested or by the test program. This means that, in general, tests using a tolerance relative to the final values calculated by the function being tested or by the test program are wrong. The proper form of a test should be something like this:

double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
   // Report error here.

总而言之,这给我们带来的容忍度要大于由于浮点舍入而引起的任何可能的差异,因此,它绝对不能给我们带来误报(报告测试功能未损坏的情况).但是,与由我们要检测的错误引起的错误相比,它很小,因此它很少会给我们带来假阴性(无法报告实际错误).

In summary, this gives us a tolerance that is larger than any possible differences due to floating-point rounding, so it should never give us a false positive (report the test function is broken when it is not). However, it is very small compared to the errors caused by the bugs we want to detect, so it should rarely give us a false negative (fail to report an actual bug).

(请注意,还有一些舍入误差计算公差,但是它们小于我在系数中使用.001所允许的斜率,因此我们可以忽略它们.)

(Note that there are also rounding errors computing the tolerance, but they are smaller than the slop I have allowed for in using .001 in the coefficients, so we can ignore them.)

(还请注意,! (difference < tolerance)不等同于difference >= tolerance.如果该函数由于错误而产生NaN,则任何比较都会得出false:difference < tolerancedifference >= tolerance都得出false,但是! (difference < tolerance)会得出正确的结果.)

(Also note that ! (difference < tolerance) is not equivalent to difference >= tolerance. If the function produces a NaN, due to a bug, any comparison yields false: both difference < tolerance and difference >= tolerance yield false, but ! (difference < tolerance) yields true.)

这篇关于如何避免单元测试中的浮点舍入错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆