Monte Carlo 模拟的运行速度明显慢于顺序 [英] Monte Carlo simulation runs significantly slower than sequential

查看:51
本文介绍了Monte Carlo 模拟的运行速度明显慢于顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

总的来说,我对并发和并行编程的概念不熟悉.我正在尝试使用 C 中的 蒙特卡罗方法来计算 Pi.这是我的源代码:

#include #include #include #include int main(void){长点;长 m = 0;双坐标[2];双倍距离;printf("请输入点数:");scanf("%ld", &points);srand((unsigned long) time(NULL));for(long i = 0; i < points; i++){坐标[0] = ((double) rand()/(RAND_MAX));坐标[1] = ((double) rand()/(RAND_MAX));距离 = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));如果(距离<= 1)米++;}printf("Pi 大约是 %lf\n", (double) 4*m/(double) points);}

当我尝试使用 openmp api 使该程序并行运行时,它的运行速度几乎慢了 4 倍.

#include #include #include #include #include #include int main(void){长总点数;//用户给出的随机点总数volatile long total_m = 0;//圆内随机点的总数int 线程 = get_nprocs();//这是必需的,所以每个线程都知道它应该生成多少随机点printf("请输入点数:");scanf("%ld", &total_points);omp_set_num_threads(线程);#pragma omp 并行{双坐标[2];//包含每个随机点的 x 和 y长 m = 0;//任何特定线程在圆圈中的点数长点数=总点数/线程数;//每个线程应该生成的随机点数双倍距离;//随机点到圆心的距离,如果大于1则该点在圆外srand((unsigned long) time(NULL));for(long i = 0; i < points; i++){坐标[0] = ((double) rand()/(RAND_MAX));//随机 x坐标[1] = ((double) rand()/(RAND_MAX));//随机 y距离 = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));//计算距离如果(距离<= 1)米++;}#pragma omp 关键{total_m += m;}}printf(Pi 大约是 %lf\n", (double) 4*total_m/(double) total_points);}

我尝试查找原因,但对不同的算法有不同的答案.

解决方案

您的代码中有两个开销来源,即critical region 和对 rand() 的调用.使用 rand_r 而不是 rand():

<块引用>

我认为您正在寻找 rand_r(),它明确地采用当前 RNG 状态作为参数.然后每个线程应该有它的自己的种子数据副本(是否希望每个线程以相同的种子或不同的种子取决于你在做什么,在这里你希望它们不同,否则你会一次又一次地得到同一行).

可以使用 OpenMP 子句 reduction 删除临界区.此外,您既不需要调用 sqrt 也不需要通过线程手动划分点(ie, long points = total_points/threads;),您可以为此使用 #pragma omp for.因此,您的代码将如下所示:

#include #include #include #include #include #include int main(void){长总点数;long total_m = 0;int 线程 = get_nprocs();printf("请输入点数:");scanf("%ld", &total_points);omp_set_num_threads(线程);#pragma omp 并行{unsigned int myseed = omp_get_thread_num();#pragma omp 用于减少(+:total_m)for(long i = 0; i < total_points; i++){if(pow((double) rand_r(&myseed)/(RAND_MAX), 2) + pow((double) rand_r(&myseed)/(RAND_MAX), 2) <= 1)total_m++;}}printf("Pi 大约是 %lf\n", (double) 4*total_m/(double) total_points);}

在我的机器上快速测试输入 1000000000:

sequential : 16.282835 秒2 个线程:8.206498 秒(快 1.98 倍)4 个线程:4.107366 秒(快 3.96 倍)8 个线程:2.728513 秒(快 5.96 倍)

请记住,我的机器只有 4 个内核.尽管如此,为了更有意义的比较,应该尽量优化顺序代码,然后与并行版本进行比较.自然,如果顺序版本尽可能优化,并行版本的加速可能会下降.例如,针对 @user3666197 提供的代码的顺序版本测试当前的并行版本,产生结果如下:

sequential : 9.343118 秒2 个线程:8.206498 秒(快 1.13 倍)4 个线程:4.107366 秒(快 2.27 倍)8 个线程:2.728513 秒(快 3.42 倍)

但是,您还可以改进并行版本,等等等等.例如,如果采用 @user3666197 版本,则修复 更新的竞争条件坐标(在线程之间共享),并添加OpenMP #pragma omp for,我们有以下代码:

int main(void){双开始 = omp_get_wtime();多头点数 = 1000000000;//................................................. 避免输入长 m = 0;无符号长 HAUSNUMERO = 1;双 DIV1byMAXbyMAX = 1./RAND_MAX/RAND_MAX;int 线程 = get_nprocs();omp_set_num_threads(线程);#pragma omp 并行缩减 (+: m ){unsigned int aThreadSpecificSEED_x = HAUSNUMERO + 1 + omp_get_thread_num();unsigned int aThreadSpecificSEED_y = HAUSNUMERO - 1 + omp_get_thread_num();#pragma omp for nowaitfor(long i = 0; i < points; i++){double x = rand_r( &aThreadSpecificSEED_x );double y = rand_r( &aThreadSpecificSEED_y );m += (1 >= ( x * x + y * y ) * DIV1byMAXbyMAX);}}双端 = omp_get_wtime();printf("%f\n",end-start);printf("Pi 大约是 %lf\n", (double) 4*m/(double) points);}

产生以下结果:

sequential : 9.160571 秒2 个线程:4.769141 秒(快 1.92 倍)4 个线程:2.456783 秒(快 3.72 倍)8 个线程:2.203758 秒(快 4.15 倍)

我正在使用标志 -O3 -std=c99 -fopenmp 进行编译,并使用 gcc 版本 4.9.3 (MacPorts gcc49 4.9.3_0).>

I'm new to the concept of concurrent and parallel programing in general. I'm trying to calculate Pi using Monte Carlo method in C. Here is my source code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

int main(void)
{
    long points;
    long m = 0;
    double coordinates[2];
    double distance;
    printf("Enter the number of points: ");
    scanf("%ld", &points);

    srand((unsigned long) time(NULL));
    for(long i = 0; i < points; i++)
    {
        coordinates[0] = ((double) rand() / (RAND_MAX));
        coordinates[1] = ((double) rand() / (RAND_MAX));
        distance = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));
        if(distance <= 1)
            m++;
    }

    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

When I try to make this program parallel using openmp api it runs almost 4 times slower.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{

    long total_points;              // Total number of random points which is given by the user
    volatile long total_m = 0;      // Total number of random points which are inside of the circle
    int threads = get_nprocs();     // This is needed so each thred knows how amny random point it should generate
    printf("Enter the number of points: ");
    scanf("%ld", &total_points);
    omp_set_num_threads(threads);   

    #pragma omp parallel
    {
       double coordinates[2];          // Contains the x and y of each random point
       long m = 0;                     // Number of points that are in the circle for any particular thread
       long points = total_points / threads;   // Number of random points that each thread should generate
       double distance;                // Distance of the random point from the center of the circle, if greater than 1 then the point is outside of the circle
       srand((unsigned long) time(NULL));

        for(long i = 0; i < points; i++)
        {
           coordinates[0] = ((double) rand() / (RAND_MAX));    // Random x
           coordinates[1] = ((double) rand() / (RAND_MAX));    // Random y
           distance = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));   // Calculate the distance
          if(distance <= 1)
              m++;
       }

       #pragma omp critical
       {
           total_m += m;
       }
    }

    printf("Pi is roughly %lf\n", (double) 4*total_m / (double) total_points);
}

I tried looking up the reason but there was different answers to different algorithms.

解决方案

There are two sources of overhead in your code namely, the critical region, and the call to the rand(). Instead of rand() use rand_r:

I think you're looking for rand_r(), which explicitly takes the current RNG state as a parameter. Then each thread should have it's own copy of seed data (whether you want each thread to start off with the same seed or different ones depends on what you're doing, here you want them to be different or you'd get the same row again and again).

The critical region can be removed by using OpenMP clause reduction. Moreover, you neither need to call the sqrt nor to divide manually the points by the threads (i.e., long points = total_points / threads;), you can use #pragma omp for for that. So your code would look like the following:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{
    long total_points; 
    long total_m = 0;
    int threads = get_nprocs();   
    printf("Enter the number of points: ");
    scanf("%ld", &total_points);
    omp_set_num_threads(threads);   

    #pragma omp parallel 
    {                  
        unsigned int myseed = omp_get_thread_num();
        #pragma omp for reduction (+: total_m)
        for(long i = 0; i < total_points; i++){
            if(pow((double) rand_r(&myseed) / (RAND_MAX), 2) + pow((double) rand_r(&myseed) / (RAND_MAX), 2) <= 1)
               total_m++;
         }
     }
    printf("Pi is roughly %lf\n", (double) 4*total_m / (double) total_points);

}

A quick test on my machine for an input of 1000000000:

sequential : 16.282835 seconds 
2 threads  :  8.206498 seconds  (1.98x faster)
4 threads  :  4.107366 seconds  (3.96x faster)
8 threads  :  2.728513 seconds  (5.96x faster)

Bear in mind that my machine has only 4 cores. Notwithstanding, for a more meaningful comparison, one should try to optimized the sequential code as much as possible, and then compared it with the parallel versions. Naturally, if the sequential version is as optimized as possible, the speedups of the parallel version might drop. For instance, testing the current parallel version without modifications against the sequential version of code provided by @user3666197, yield the following results:

sequential :  9.343118 seconds 
2 threads  :  8.206498 seconds  (1.13x faster)
4 threads  :  4.107366 seconds  (2.27x faster)
8 threads  :  2.728513 seconds  (3.42x faster)

However, one could also improve the parallel version as well as, and so on and so fourth. For instance, if one takes @user3666197 version, fix the race condition of the update of the coordinates (which is shared among threads), and adds the OpenMP #pragma omp for, we have the following code:

int main(void)
{
    double start = omp_get_wtime();
    long points = 1000000000; //....................................... INPUT AVOIDED
    long m = 0;
    unsigned long HAUSNUMERO = 1;
    double DIV1byMAXbyMAX = 1. / RAND_MAX / RAND_MAX;

    int threads = get_nprocs();
    omp_set_num_threads(threads);
    #pragma omp parallel reduction (+: m )
    {
        unsigned int aThreadSpecificSEED_x = HAUSNUMERO + 1 + omp_get_thread_num();
        unsigned int aThreadSpecificSEED_y = HAUSNUMERO - 1 + omp_get_thread_num();
        #pragma omp for nowait
        for(long i = 0; i < points; i++)
        {
            double x = rand_r( &aThreadSpecificSEED_x );
            double y = rand_r( &aThreadSpecificSEED_y );
            m += (1  >= ( x * x + y * y ) * DIV1byMAXbyMAX);
        }
    }
    double end = omp_get_wtime();
    printf("%f\n",end-start);
    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

which yields the following results:

sequential :  9.160571 seconds 
2 threads  :  4.769141 seconds  (1.92 x faster)
4 threads  :  2.456783 seconds  (3.72 x faster)
8 threads  :  2.203758 seconds  (4.15 x faster)

I am compiling with the flags -O3 -std=c99 -fopenmp, and using the gcc version 4.9.3 (MacPorts gcc49 4.9.3_0).

这篇关于Monte Carlo 模拟的运行速度明显慢于顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆