不确定如何解释我的并行矩阵乘法代码的一些性能结果 [英] Not sure how to explain some of the performance results of my parallelized matrix multiplication code

查看:93
本文介绍了不确定如何解释我的并行矩阵乘法代码的一些性能结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在OpenMP中运行此代码以进行矩阵乘法,并测量了其结果:

I'm running this code in OpenMP for matrix multiplication and I measured its results:

#pragma omp for schedule(static)
for (int j = 0; j < COLUMNS; j++)
    for (int k = 0; k < COLUMNS; k++)
        for (int i = 0; i < ROWS; i++)
            matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];

根据我将#pragma omp指令放在何处的代码,代码的版本不同-在j循环,k循环或i循环之前.另外,对于这些变体中的每一个,我都针对默认的静态调度,具有块1和10的静态调度以及具有相同块的动态调度运行了不同的版本.我还测量了CodeXL中的DC访问次数,DC未命中次数,CPU时钟,退休指令以及其他性能指标.以下是AMD Phenom I X4 945上大小为1000x1000的矩阵的结果:

There are different versions of the code based on where i put the #pragma omp directive - before the j loop, k loop, or the i loop. Also, for every one of those variants I ran different versions for default static scheduling, static scheduling with chunks 1 and 10 and dynamic scheduling with the same chunks. I also measured the number of DC accesses, DC misses, CPU clocks, retired instructions, and other performance indicators in CodeXL. Here are the results for the matrix of size 1000x1000 on AMD Phenom I X4 945:

性能评估的结果

其中multiply_matrices_1_dynamic_1是在第一次循环之前使用#pragma omp的函数,并使用块1进行动态调度,以此类推.以下是一些我对结果不太了解的内容,希望对您有所帮助:

Where multiply_matrices_1_dynamic_1 is a function with #pragma omp before the first loop and dynamic scheduling with chunk 1, etc. Here are some things I don't quite understand about the results and would appreciate help:

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆