SSE版本算法中累积的计算误差的平方差之和 [英] An accumulated computing error in SSE version of algorithm of the sum of squared differences

查看:438
本文介绍了SSE版本算法中累积的计算误差的平方差之和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图优化以下代码(两个数组的平方差之和):

I was trying to optimize following code (sum of squared differences for two arrays):

inline float Square(float value)
{
    return value*value;
}

float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
    float sum = 0;
    for(size_t i = 0; i < size; ++i)
        sum += Square(a[i] - b[i]);
    return sum;
}

所以我使用CPU的SSE指令进行优化:

So I performed optimization with using of SSE instructions of CPU:

inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum)
{
    __m128 _a = _mm_loadu_ps(a + i);
    __m128 _b = _mm_loadu_ps(b + i);
    __m128 _d = _mm_sub_ps(_a, _b);
    sum = _mm_add_ps(sum, _mm_mul_ps(_d, _d));
}

inline float ExtractSum(__m128 a)
{
    float _a[4];
    _mm_storeu_ps(_a, a);
    return _a[0] + _a[1] + _a[2] + _a[3];
}

float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
    size_t i = 0, alignedSize = size/4*4;
    __m128 sums = _mm_setzero_ps();
    for(; i < alignedSize; i += 4)
        SquaredDifferenceSum(a, b, i, sums);
    float sum = ExtractSum(sums);
    for(; i < size; ++i)
        sum += Square(a[i] - b[i]);
    return sum;
}

如果数组的大小不太大,
但是如果大小足够大,则在基函数和其优化版本给出的结果之间存在大的计算误差。
因此我有一个问题:这里的是在SSE优化的代码中的错误,这导致计算错误。

This code works fine if the size of the arrays is not too large. But if the size is big enough then there is a large computing error between results given by base function and its optimized version. And so I have a question: Where is here a bug in SSE optimized code, which leads to the computing error.

推荐答案

错误遵循有限精度浮点数。
每两个浮点数的相加都有一个与它们之间的差值成正比的计算误差。
在你的标量版本的算法中,结果的总和比每个项(如果数组的大小足够大)。
因此它导致大计算错误的累积。

The error follows from finite precision floating point numbers. Each addition of two floating point numbers is has an computing error proportional to difference between them. In your scalar version of algorithm the resulting sum is much greater then each term (if size of arrays is big enough of course). So it leads to accumulation of big computing error.

在SSE版本的算法中,实际上有四个结果累加和。并且这些和和每个项之间的差相对于标量码在四倍中较小。
这样会导致较小的计算错误。

In the SSE version of algorithm actually there is four sums for results accumulation. And difference between these sums and each term is lesser in four times relative to scalar code. So this leads to the lesser computing error.

有两种方法可以解决此错误:

There are two ways to solve this error:

1)使用双精度的浮点数来累加和。

1) Using of floating point numbers of double precision for accumulating sum.

2)使用Kahan求和算法(也称为补偿求和)通过添加有限精度浮点数序列获得的总数中的数值误差,与明显的方法相比。

2) Using of the the Kahan summation algorithm (also known as compensated summation) which significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

使用Kahan求和算法,您的标量代码将如下所示:

With using of Kahan summation algorithm your scalar code will look like:

inline void KahanSum(float value, float & sum, float & correction)
{
    float term = value - correction;
    float temp = sum + term;
    correction = (temp - sum) - term;
    sum = temp; 
}

float SquaredDifferenceKahanSum(const float * a, const float * b, size_t size)
{
    float sum = 0, correction = 0;
    for(size_t i = 0; i < size; ++i)
        KahanSum(Square(a[i] - b[i]), sum, correction);
    return sum;
}

并且SSE优化代码如下:

And SSE optimized code will look as follow:

inline void SquaredDifferenceKahanSum(const float * a, const float * b, size_t i, 
                                      __m128 & sum, __m128 & correction)
{
    __m128 _a = _mm_loadu_ps(a + i);
    __m128 _b = _mm_loadu_ps(b + i);
    __m128 _d = _mm_sub_ps(_a, _b);
    __m128 term = _mm_sub_ps(_mm_mul_ps(_d, _d), correction);
    __m128 temp = _mm_add_ps(sum, term);
    correction = _mm_sub_ps(_mm_sub_ps(temp, sum), term);
    sum = temp; 
}

float SquaredDifferenceKahanSum(const float * a, const float * b, size_t size)
{
    size_t i = 0, alignedSize = size/4*4;
    __m128 sums = _mm_setzero_ps(), corrections = _mm_setzero_ps();
    for(; i < alignedSize; i += 4)
        SquaredDifferenceKahanSum(a, b, i, sums, corrections);
    float sum = ExtractSum(sums), correction = 0;
    for(; i < size; ++i)
        KahanSum(Square(a[i] - b[i]), sum, correction);
    return sum;
}

这篇关于SSE版本算法中累积的计算误差的平方差之和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆