单/多线程(OpenMP)模式下计算精度的差异 [英] The differences in the accuracy of the calculations in single / multi-threaded (OpenMP) modes

查看:299
本文介绍了单/多线程(OpenMP)模式下计算精度的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释/理解单线程/多线程模式下计算结果的不同吗?

Can anybody explain/understand the different of the calculation result in single / multi-threaded mode?

以下是一个示例. pi的计算:

Here is an example of approx. calculation of pi:

#include <iomanip>
#include <cmath>
#include <ppl.h>

const int itera(1000000000);

int main()
{
    printf("PI calculation \nconst int itera = 1000000000\n\n");

    clock_t start, stop;

    //Single thread
    start = clock();
    double summ_single(0);
    for (int n = 1; n < itera; n++)
    {
        summ_single += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
    };
    stop = clock();
    printf("Time single thread             %f\n", (double)(stop - start) / 1000.0);


    //Multithread with OMP
    //Activate OMP in Project settings, C++, Language
    start = clock();
    double summ_omp(0);
#pragma omp parallel for reduction(+:summ_omp)
    for (int n = 1; n < itera; n++)
    {
        summ_omp += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
    };
    stop = clock();
    printf("Time OMP parallel              %f\n", (double)(stop - start) / 1000.0);


    //Multithread with Concurrency::parallel_for
    start = clock();
    Concurrency::combinable<double> piParts;
    Concurrency::parallel_for(1, itera, [&piParts](int n)
    {
        piParts.local() += 6.0 / (static_cast<double>(n)* static_cast<double>(n)); 
    }); 

    double summ_Conparall(0);
    piParts.combine_each([&summ_Conparall](double locali)
    {
        summ_Conparall += locali;
    });
    stop = clock();
    printf("Time Concurrency::parallel_for %f\n", (double)(stop - start) / 1000.0);

    printf("\n");
    printf("pi single = %15.12f\n", std::sqrt(summ_single));
    printf("pi omp    = %15.12f\n", std::sqrt(summ_omp));
    printf("pi comb   = %15.12f\n", std::sqrt(summ_Conparall));
    printf("\n");

    system("PAUSE");

}

结果:

PI calculation VS2010 Win32
Time single thread 5.330000
Time OMP parallel 1.029000
Time Concurrency:arallel_for 11.103000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592651497


PI calculation VS2013 Win32
Time single thread 5.200000
Time OMP parallel 1.291000
Time Concurrency:arallel_for 7.413000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592647841


PI calculation VS2010 x64
Time single thread 5.190000
Time OMP parallel 1.036000
Time Concurrency::parallel_for 7.120000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592649319


PI calculation VS2013 x64
Time single thread 5.230000
Time OMP parallel 1.029000
Time Concurrency::parallel_for 5.326000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592648489

测试是在Win 7 x64的AMD和Intel CPU上进行的.

The tests were made on AMD and Intel CPUs, Win 7 x64.

单核和多核的PI计算之间存在差异的原因是什么? 为什么在不同的版本(编译器,32/64位平台)上,使用Concurrency :: parallel_for的计算结果不是恒定的?

What is the reason for difference between PI calculation in single and multicore? Why the result of calculation with Concurrency::parallel_for is not constant on different builds (compiler, 32/64 bit platform)?

P.S. Visual Studio Express不支持OpenMP.

P.S. Visual studio express doesn’t support OpenMP.

推荐答案

由于舍入误差,浮点加法是非关联操作,因此操作顺序很重要.让您的并行程序给出与串行版本不同的结果是正常的.理解和处理它是编写(便携式)并行代码技术的一部分.由于在32位模式下VS编译器使用x87指令,而x87 FPU以80位的内部精度进行所有操作,因此这在32位对64位的版本中会更加严重.在64位模式下,使用SSE数学.

Floating-point addition is a non-associative operation due to round-off errors, therefore the order of operations matters. Having your parallel program give different result(s) than the serial version is something normal. Understanding and dealing with it is part of the art of writing (portable) parallel codes. This is exacerbated in the 32- against 64-bit builds since in 32-bit mode the VS compiler uses x87 instructions and the x87 FPU does all operations with an internal precision of 80 bits. In 64-bit mode SSE math is used.

在串行情况下,一个线程计算s 1 + s 2 + ... + s N ,其中 N 是扩展中的术语数.

In the serial case, one thread computes s1+s2+...+sN, where N is the number of terms in the expansion.

在OpenMP中,存在 n 个部分和,其中 n 是OpenMP线程数.哪些项进入每个部分和取决于迭代在线程之间的分配方式.许多OpenMP实现的默认设置是静态调度,这意味着线程0(主线程)计算ps 0 = s 1 + s 2 + ... + s N/n ;线程1计算ps 1 = s N/n + 1 + s N/n + 2 + ... + s 2N /n ;等等.最后,归约法以某种方式组合了这些部分和.

In the OpenMP case there are n partial sums, where n is the number of OpenMP threads. Which terms get into each partial sum depends on the way iterations are distributed among the threads. The default for many OpenMP implementations is static scheduling, which means that thread 0 (the main thread) computes ps0 = s1 + s2 + ... + sN/n; thread 1 computes ps1 = sN/n+1 + sN/n+2 + ... + s2N/n; and so on. In the end the reduction combines somehow those partial sums.

parallel_for情况与OpenMP非常相似.区别在于,默认情况下,迭代以动态方式分布-请参见 auto_partitioner ,因此每个部分和都包含或多或少的随机选择项.这不仅会给出稍微不同的结果,而且在每次执行时也会给出略有不同的结果,即来自两个连续的parallel_for且线程数相同的结果可能有所不同.如果您将分区程序替换为 simple_partitioner 的实例,然后设置如果块大小等于itera / number-of-threads,则如果以相同的方式执行缩减操作,则应该获得与OpenMP情况 相同的结果.

The parallel_for case is very similar to the OpenMP one. The difference is that by default the iterations are distributed in a dynamic fashion - see the documentation for auto_partitioner, therefore each partial sum contains a more or less random selection of terms. This not only gives a slightly different result, but it also gives a slightly different result with each execution, i.e. the result from two consecutive parallel_for's with the same number of threads might differ a bit. If you replace the partitioner with an instance of simple_partitioner and set the chunk size equal to itera / number-of-threads, you should get the same result as in the OpenMP case if the reduction is performed the same way.

您可以使用 Kahan求和,也可以使用Kahan求和来实现自己的归约.然后,并行代码应产生与串行代码相同(更多相似)的结果.

You could use Kahan summation and implement your own reduction also using Kahan summation. Then the parallel codes should produce the same (over much more similar) result as the serial one.

这篇关于单/多线程(OpenMP)模式下计算精度的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆