不确定如何解释我的并行矩阵乘法代码的一些性能结果 [英] Not sure how to explain some of the performance results of my parallelized matrix multiplication code
问题描述
我正在OpenMP中运行此代码以进行矩阵乘法,并测量了其结果:
I'm running this code in OpenMP for matrix multiplication and I measured its results:
#pragma omp for schedule(static)
for (int j = 0; j < COLUMNS; j++)
for (int k = 0; k < COLUMNS; k++)
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
根据我将#pragma omp
指令放在何处的代码,代码的版本不同-在j循环,k循环或i循环之前.另外,对于这些变体中的每一个,我都针对默认的静态调度,具有块1和10的静态调度以及具有相同块的动态调度运行了不同的版本.我还测量了CodeXL中的DC访问次数,DC未命中次数,CPU时钟,退休指令以及其他性能指标.以下是AMD Phenom I X4 945上大小为1000x1000的矩阵的结果:
There are different versions of the code based on where i put the #pragma omp
directive - before the j loop, k loop, or the i loop. Also, for every one of those variants I ran different versions for default static scheduling, static scheduling with chunks 1 and 10 and dynamic scheduling with the same chunks. I also measured the number of DC accesses, DC misses, CPU clocks, retired instructions, and other performance indicators in CodeXL. Here are the results for the matrix of size 1000x1000 on AMD Phenom I X4 945:
其中multiply_matrices_1_dynamic_1
是在第一次循环之前使用#pragma omp
的函数,并使用块1进行动态调度,以此类推.以下是一些我对结果不太了解的内容,希望对您有所帮助:
Where multiply_matrices_1_dynamic_1
is a function with #pragma omp
before the first loop and dynamic scheduling with chunk 1, etc. Here are some things I don't quite understand about the results and would appreciate help:
- 内部循环之前具有并行化功能的默认静态版本在2,512s中运行,而顺序版本在31,683s中运行-尽管事实上它在4核计算机上运行,所以我认为最大可能的提速将是是4倍.从逻辑上讲是有可能发生的,还是这有些错误?我该怎么解释?
- CodeXL表示,具有静态调度功能的第三个版本的DC访问(和未命中)次数比其他版本少得多.这是为什么?是不是因为所有并行线程都在b矩阵的同一单元上运行,这不是吗?是吗?
- 我知道第二个版本是不正确的版本,这是因为线程可能在R矩阵的同一单元上运行,这可能会导致竞争条件.但是为什么它的性能较低?错误会以某种方式引起它吗?
- 我意识到动态调度会导致开销,这就是为什么使用它时代码变慢的原因.同样,在第3版中(在i循环之前有编译指示),执行调度的次数要多得多(使用1000x1000矩阵进行十亿次调度),因此该算法的性能要差得多.但是,此调度似乎并不会导致第二版中的速度变慢(在第二个循环之前使用编译指示).为什么会这样?
- The default static version with parallelization before the inner loop runs in 2,512s, while the sequential version runs in 31,683s - that's despite the fact that it ran on a 4-core machine, so I imagined that the biggest possible speedup would be 4x. Is it logically possible for that to occur, or is this some error? How can I explain that?
- CodeXL says that the 3rd version with static scheduling has a much lower number of DC accesses (and misses) than the other versions. Why is that? Isn't it largely because all the parallel threads operate on the same cell of the b matrix. Is that it?
- I know that the 2nd version is an incorrect one due to the fact that the threads might operate on the same cell of the R matrix, which might cause race conditions. Why does it have a lower performance, though? Does incorrectness cause it somehow?
- I realize that dynamic scheduling causes an overhead, which is why the code slows down when using it. Also, with the 3rd version (with pragma before the i loop) the scheduling is performed many more times (billion times with a 1000x1000 matrix), so the algorithm is much less performant. However, this scheduling doesnt seem to cause slowdown in the 2nd version (with pragma before the 2nd loop). Why is that?
此外,我对TLB丢失与缓存未命中的关系感到困惑.何时专门使用DTLB?我教授的文档说,每个DC访问都是DTLB请求,但我不知道它是如何工作的-TLB未命中的数量通常大于DC访问的数量.如何计算TLB遗漏率?我的教授说这是TBL未命中/DC访问.他还说我可以通过缓存命中率计算时间局部性,并通过TLB命中率计算空间局部性.究竟是如何工作的?
Also, I'm confused about the relation of TLB misses to cache misses. When is DTLB used specifically? My professor's document says that every DC access is a DTLB request, but I don't understand how that works - the number of TLB misses is often bigger than the number of DC accesses. How do I compute TLB miss ratio? My professor says it's TBL misses / DC accesses. He also says I can compute temporal locality by the cache hit ratio and spatial locality by TLB hit ratio. How does that work exactly?
推荐答案
Gilles has the right idea that your code is cache unfriendly but his solution still has a similar problem because it does the reduction over
k
onmatrix_b[k][j]
.一种解决方案是计算
matrix_b
的转置,然后可以在k
上的matrix_bT[j][k]
上运行,这对缓存很友好.转置为O(n^2))
,矩阵乘法为O(n^3)
,因此转置的成本为1/n
.即对于n
而言,它可以忽略不计.One solution is to calculate a transpose of
matrix_b
then you can run overmatrix_bT[j][k]
overk
which is cache friendly. The transpose goes asO(n^2))
and matrix multiplication asO(n^3)
so the cost of the transpose goes1/n
. i.e. for largen
it becomes negligible.但是有比使用转置更简单的解决方案.像这样对
j
进行归约:But there is an even easier solution than using a transpose. Do the reduction over
j
like this:#pragma omp for schedule(static) for (int i = 0; i < ROWS; i++ ) { for (int k = 0; k < COLUMNS; k++ ) { for ( int j = 0; j < COLUMNS; j++ ) { matrix_r[i][j] += matrix_a[i][k]*matrix_b[k][j]; } } }
Gilles的方法每次迭代需要两次从内存中读取数据,而此解决方案每次迭代需要两次读取并向内存写入数据,但它对缓存的友好性要强得多,而不仅仅是弥补对内存的写入.
Gilles' method requires two reads from memory per iteration whereas this solution requires two reads and a write to memory per iteration but it's much more cache friendly which more than makes up for the write to memory.
这篇关于不确定如何解释我的并行矩阵乘法代码的一些性能结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!