C ++ OpenMP在矩阵向量乘积上的运行速度非常慢 [英] C++ OpenMP working really slow on matrix-vector product

查看:157
本文介绍了C ++ OpenMP在矩阵向量乘积上的运行速度非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在使用openMP制作矩阵向量乘积,但我注意到它的工作速度非常慢.经过一段时间尝试找出问题所在后,我只是删除了并行部分中的所有代码,但它仍然很慢.这里有什么问题? (n = 1000)

So, I'm making matrix-vector product using openMP, but I've noticed it's working reallllly slow. After some times trying to figure out whats wrong I just deleted all code in parallel section and its still SLOW. What can be problem here? (n = 1000)

以下是1、2和4核的时间结果.

Here is time results for 1, 2 and 4 cores.

seq_method时间= 0.001047194215062

seq_method time = 0.001047194215062

parrallel_method(1)时间= 0.001050273191140 seq-par = -0.000003078976079 seq/par = 0.997068404578433

parrallel_method (1) time = 0.001050273191140 seq - par = -0.000003078976079 seq/par = 0.997068404578433

parrallel_method(2)时间= 0.001961992426004 seq-par = -0.000914798210943 seq/par = 0.533740192460558

parrallel_method (2) time = 0.001961992426004 seq - par = -0.000914798210943 seq/par = 0.533740192460558

parrallel_method(4)时间= 0.004448095121916 seq-par = -0.003400900906854 seq/par = 0.235425319459132

parrallel_method (4) time = 0.004448095121916 seq - par = -0.003400900906854 seq/par = 0.235425319459132

即使我从并行部分删除代码,也不会有太大变化.

Even when I delete code from parallel section - it doesnt change much.

void parallel_method(float A[n][n], float B[n], float C[n], int thr_num)
{
    double t1, t2;
    float tmp = 0;
    int i, j;
    t1 = omp_get_wtime();


    omp_set_dynamic(0);
    omp_set_num_threads(thr_num);
#pragma omp parallel for private(tmp, j, i)
    for (i = 0; i < n; i++) {
        tmp = 0;
        for (j = 0; j < n; j++) {
            tmp += A[i][j] * B[j];
        }
#pragma omp atomic
        C[i] += tmp;
    }

    //////
    t2 = omp_get_wtime();
    if (show_c) print_vector(C);
    par = t2 - t1;
    printf("\nparrallel_method (%d) time = %.15f", thr_num, par);
    printf("\nseq - par = %.15f", seq - par);
    printf("\nseq/par = %.15f\n", seq / par);
}

代码: https://pastebin.com/Q20t5DLk

推荐答案

我试图重现您的问题,但无法做到这一点. 我有一个完全连贯的行为.

I tried to reproduce your problem and was not able to do that. I have a completely coherent behavior.

n=100
sequential_method (0) time = 0.000023339001928
parallel_method (1) time = 0.000023508997401
parallel_method (2) time = 0.000013864002540
parallel_method (4) time = 0.000008979986887

n=1000
sequential_method (0) time = 0.001439775005565
parallel_method (1) time = 0.001437967992388
parallel_method (2) time = 0.000701391996699
parallel_method (4) time = 0.000372130998080

n=10000
sequential_method (0) time = 0.140988592000213
parallel_method (1) time = 0.133375317003811
parallel_method (2) time = 0.077803490007180
parallel_method (4) time = 0.044142695999355

除了小巧的线程开销很大以外,结果或多或少是预期的.

Except for small size, where thread overhead is significant, the results are more or less what is expected.

我做了什么:

  • 所有度量均在同一运行中完成

  • all measures are done in the same run

我一次运行所有功能而没有定时预热缓存

I run all functions once without timing to warm-up the caches

在实际代码估计中,我也会

In real code estimations, I would have also

  • 对同一函数的多个连续执行进行计时,尤其是在时间较短的情况下,以减少较小的差异

  • time several successive executions of the same function, especially if time is short in order to reduce small variations

运行多个实验,并保留最小的实验以抑制异常值. (我更喜欢最小值,但您也可以计算平均值).

run several experiments and keep the smallest one to suppress outliers. (I prefer minimum, but you can also compute the average).

您应该已经发布了所有代码,但我不知道您的方法是什么.但是我认为您的估算是在不同的运行中完成的,并且不会预热缓存.对于此代码,缓存影响非常重要,内核必须存储相同的信息(B).而且问题还不够大,无法从较大的L1/L2缓存中受益.这些倍数负载可能解释了并行代码的较差性能.

You should have posted all you code, and I do not know what is your methodology. But I think that your estimations where done in different runs and without warming up the caches. For this code, cache impact is very important and cores have to store the same information (B). And the problem is not large enough to benefit from the larger L1/L2 caches. These multiples loads may explain the worse performances of parallel code.

关于代码的最后一句话.每个线程将具有自己的i值.因此,C [i]只能由一个线程访问,并且原子编译指令是无用的.

On last remark on your code. Every thread will have their own values of i. Hence C[i] can be accessed by only one thread and the atomic pragma is useless.

这篇关于C ++ OpenMP在矩阵向量乘积上的运行速度非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆