OpenMP和GSL RNG - 性能问题 - 4线程实现10倍比纯顺序写入速度较慢(四核CPU) [英] OpenMP and GSL RNG - Performance Issue - 4 threads implementation 10x slower than pure sequential one (quadcore CPU)

查看:535
本文介绍了OpenMP和GSL RNG - 性能问题 - 4线程实现10倍比纯顺序写入速度较慢(四核CPU)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图把我的一个C项目从顺序到并行编程。虽然大多数$ C $的c现已从零开始重新设计用于此目的,随机数的产生仍是其核心。因此,随机数发生器(RNG)的糟糕表现影响非常糟糕方案的整体性能。

我写了一些code线(见下文),以显示我现在面临没有太大的冗长的问题。

的问题是下列:线程每次的编号NT的增加,性能得到singnificantly更糟。在该工作站(Linux内核2.6.33.4; GCC 4.4.4;英特尔四核的CPU)的并行for循环需要大约10倍的较长的带NT = 4结束比NT = 1,无论迭代数n的

这情况似乎在这里描述但重点主要是在Fortran中,语言,我不是很了解,所以我非常AP preciate一些帮助。

我试图按照自己创建不同的RNG(用不同的种子),以每个线程访问,但性能依然非常糟糕的主意。实际上,对于每个线程错误此不同播点我好了,因为我看不出它是如何可能一个保证生成的数字到底是质量(缺乏相关性等)。

我已经想过干脆放弃GSL和实施一个随机生成算法(如梅森捻)的自己,但我怀疑我只是碰到了同样的问题以后。

非常感谢你提前为您解答和建议。请不要问什么重要的我可能已经忘记了就更不用说了。

编辑:按lucas1024(编译for循环声明)和JonathanDursi建议实施的更正(播种;设置一作为一个私有变量)。性能仍处于多线程模式非常缓慢。

编辑2:实施解决方案由Jona​​than Dursi(见注释)建议

 的#include<&stdio.h中GT;
    #包括LT&;&stdlib.h中GT;
    #包括LT&;&time.h中GT;
    #包括LT&; GSL / gsl_rng.h>
    #包括LT&;&omp.h GT;    双D_T(结构的timespec T1,结构的timespec T2){        回报(t2.tv_sec-t1.tv_sec)+(双)(t2.tv_nsec-t1.tv_nsec)/1000000000.0;
    }    INT主(INT ARGC,CHAR *的argv []){        双A,B;        INT I,J,K;        INT N =的atoi(ARGV [1]),种子=的atoi(argv的[2]),NT =的atoi(argv的[3]);        的printf(\\ NN \\ T =%d个,N);
        的printf(\\ nseed \\ T =%d个,种子);
        的printf(\\ NNT \\ T =%d个,NT);        结构体的timespec T1,T2,T3,T4;        clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&放大器T1);        //初始化GSL随机数发生器
        常量gsl_rng_type * rng_t;
        gsl_rng ** RNG;
        gsl_rng_env_setup();
        rng_t = gsl_rng_default;        RNG =(gsl_rng **)的malloc(NT *的sizeof(gsl_rng *));            OMP的#pragma为平行NUM_THREADS(NT)
        对于(i = 0; I< NT;我++){
            RNG [I] = gsl_rng_alloc(rng_t);
            gsl_rng_set(RNG [I],籽*一);
        }        clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&放大器; T2);        对于(i = 0; I< N;我++){
            一个= gsl_rng_uniform(RNG [0]);
        }        clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&安培; T3);        OMP_SET_NUM_THREADS(NT);
        OMP的#pragma私人平行(J,A)
        {
            J = omp_get_thread_num();
            OMP的#pragma为
            对于(i = 0; I< N;我++){
                A = gsl_rng_uniform(RNG [J]);
            }
        }        clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&安培; T4);        的printf(\\ n \\ ninitializing:\\ t \\ TT1 =%F秒,D_T(T1,T2));
        输出(for循环\\ nsequencial:\\ TT2 =%F秒D_T(T2,T3));
        的printf(\\ nparalel循环:\\ TT3 =%F秒(%F * T2),D_T(T3,T4),(双)D_T(T3,T4)/(双)D_T(T2,T3));
        的printf(线程\\ n已接收:\\ TNT =%d个\\ N,NT);        //自由随机数发生器
        对于(i = 0; I< NT;我++)
            gsl_rng_free(RNG [I]);
        免费(RNG);        返回0;    }


问题是在第二的#pragma OMP行。第一次的#pragma OMP派生4个线程。之后,你应该简单地说OMP的#pragma为 - 不是的#pragma OMP并行的

随着当前code,这取决于你OMP嵌套的设置,要创建4×4线程正在做同样的工作,并访问相同的数据。

I am trying to turn a C project of mine from sequential into parallel programming. Although most of the code has now been redesigned from scratch for this purpose, the generation of random numbers is still at its core. Thus, bad performance of the random number generator (RNG) affects very badly the overall performance of the program.

I wrote some code lines (see below) to show the problem I am facing without much verbosity.

The problem is the following: everytime the number of threads nt increases, the performance gets singnificantly worse. At this workstation (linux kernel 2.6.33.4; gcc 4.4.4; intel quadcore CPU) the parallel for-loop takes roughly 10x longer to finish with nt=4 than with nt=1, regardless of the number of iterates n.

This situation seems to be described here but the focus is mainly in fortran, a language I know very little about, so I would very much appreciate some help.

I tried to follow their idea of creating different RNG (with a different seed) to be accessed by each thread but the performance is still very bad. Actually, this different seeding point for each thread bugs me as well, because I cannot see how it is possible for one to guarantee the quality of the generated numbers in the end (lack of correlations, etc).

I have already thought of dropping GSL altogether and implementing a random generator algorithm (such as Mersenne-Twister) myself but I suspect I would just bump into the same issue later on.

Thank you very much in advance for your answers and advice. Please do ask anything important I may have forgotten to mention.

EDIT: Implemented corrections suggested by lucas1024 (pragma for-loop declaration) and JonathanDursi (seeding; setting "a" as a private variable). Performance is still very sluggish in multithread-mode.

EDIT 2: Implemented solution suggested by Jonathan Dursi (see comments).

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <gsl/gsl_rng.h>
    #include <omp.h>

    double d_t (struct timespec t1, struct timespec t2){

        return (t2.tv_sec-t1.tv_sec)+(double)(t2.tv_nsec-t1.tv_nsec)/1000000000.0;
    }

    int main (int argc, char *argv[]){

        double a, b;

        int i,j,k;

        int n=atoi(argv[1]), seed=atoi(argv[2]), nt=atoi(argv[3]);

        printf("\nn\t= %d", n);
        printf("\nseed\t= %d", seed);
        printf("\nnt\t= %d", nt);

        struct timespec t1, t2, t3, t4;

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);

        //initialize gsl random number generator
        const gsl_rng_type *rng_t;
        gsl_rng **rng;
        gsl_rng_env_setup();
        rng_t = gsl_rng_default;

        rng = (gsl_rng **) malloc(nt * sizeof(gsl_rng *));

            #pragma omp parallel for num_threads(nt)
        for(i=0;i<nt;i++){
            rng[i] = gsl_rng_alloc (rng_t);
            gsl_rng_set(rng[i],seed*i);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t2);

        for (i=0;i<n;i++){
            a = gsl_rng_uniform(rng[0]);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t3);

        omp_set_num_threads(nt);
        #pragma omp parallel private(j,a)
        {
            j = omp_get_thread_num();
            #pragma omp for
            for(i=0;i<n;i++){
                a = gsl_rng_uniform(rng[j]);
            }
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t4);

        printf("\n\ninitializing:\t\tt1 = %f seconds", d_t(t1,t2));
        printf("\nsequencial for loop:\tt2 = %f seconds", d_t(t2,t3));
        printf("\nparalel for loop:\tt3 = %f seconds (%f * t2)", d_t(t3,t4), (double)d_t(t3,t4)/(double)d_t(t2,t3));
        printf("\nnumber of threads:\tnt = %d\n", nt);

        //free random number generator
        for (i=0;i<nt;i++)
            gsl_rng_free(rng[i]);
        free(rng);

        return 0;

    }

解决方案

The problem is in the second #pragma omp line. The first #pragma omp spawns 4 threads. After that you are supposed to simply say #pragma omp for - not #pragma omp parallel for.

With the current code, depending on your omp nesting settings, you are creating 4 x 4 threads that are doing the same work and accessing the same data.

这篇关于OpenMP和GSL RNG - 性能问题 - 4线程实现10倍比纯顺序写入速度较慢(四核CPU)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆