元素矩阵乘法的并行化 [英] Parallelization of elementwise matrix multiplication

查看:166
本文介绍了元素矩阵乘法的并行化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在优化部分代码,因此需要执行一些基准测试。



我有 NxN - 矩阵 A T ,并且希望将它们相乘并将结果保存在 A ,即 A = A * T 。由于这段代码不是可并行化的,我把这个任务扩展到了

$ $ $ OMP PARALLEL
do j = 1,N
do i = 1,N
A(i,j)= T(i,j)* A(i,j)
end do
end do
!$ OMP END PARALLEL DO

http://pastebin.com/RGpwp2KZ 。)



现在发生的奇怪现象是,无论(1到4之间)的执行时间保持大致相同(+ - 10%),但CPU时间随着线程数量的增加而增加。这让我觉得所有线程都做同样的工作(因为我在OpenMP上犯了一个错误),因此需要同一时间。



但是在另一台计算机上96个CPU内核可用)程序的行为如预期:随着线程数量的增加,执行时间减少。令人惊讶的是,CPU时间也减少了(多达10个线程,然后再次上升)。



可能有不同版本的 OpenMP gfortran 已安装。如果这可能是原因,那么如果你能告诉我如何找到答案,那将是非常好的。

理论通过使用特定于Fortran的OpenMP WORKSHARE 指令将Fortran数组操作并行化:

  A(:, :) = T(:, :) * A(:, :) 
!$ OMP END同步WORKSHARE
ifort <>) / code>),只需通过 SINGLE 构造实现 WORKSHARE 构造,无论如何平行加速。另一方面, gfortran 将该代码片段转换为隐式 PARALLEL DO 循环。请注意,除非已写入,否则 gfortran 不会在工作共享结构内并行执行标准数组表示法 A = T * A 显式为 A(:, :) = T(:, :) * A(:,:)



关于性能和缺乏加速。您的 A T 矩阵的每一列都占用(2 * 8)* 729 = 11664 字节。一个矩阵占用8.1 MiB,两个矩阵一起占用16.2 MiB。这可能会超过CPU的最后一级缓存大小。此外,乘法码具有非常低的计算强度 - 它每次迭代获取32字节的存储器数据,并在6个FLOP中执行一次复数乘法(4次实数乘法,1次加法和1次减法),然后将16个字节存回存储器, (6 FLOP)/(48字节)= 1/8 FLOP / byte 。如果内存被认为是全双工的,即它支持在读取时写入,则强度上升到(6 FLOP)/(32字节)= 3/16 FLOP / byte 。因此,问题是内存限制,甚至一个CPU内核可能会使所有可用内存带宽饱和。例如,一个典型的x86内核可以在每个周期退出两个浮点运算,如果运行在2 GHz,它可以提供4 GFLOP / s的标量数学运算。为了让这样的内核忙于运行你的乘法循环,主内存应该提供(4 GFLOP / s)*(16/3字节/ FLOP)= 21.3 GiB / s 。这个数量或多或少超过了当代x86 CPU的实际内存带宽。这仅适用于具有非矢量化代码的单核。增加更多内核和线程不会提高性能,因为内存无法足够快速地传输数据以保持内核繁忙。相反,性能会受到影响,因为拥有更多的线程会增加更多的开销。当运行在一个像96核的multisocket系统上时,程序可以访问更多的最后一级缓存和更高的主内存带宽(假设NUMA系统在每个CPU插槽中有一个单独的内存控制器),因此性能会提高,但只是因为有更多套接字,并不是因为有更多核心


I'm currently optimizing parts of my code and therefore perform some benchmarking.

I have NxN-matrices A and T and want to multiply them elementwise and save the result in A again, i.e. A = A*T. As this code is not parallelizable I expanded the assignment into

!$OMP PARALLEL DO
do j = 1, N
    do i = 1, N
        A(i,j) = T(i,j) * A(i,j)
    end do
end do
!$OMP END PARALLEL DO

(Full minimal working example at http://pastebin.com/RGpwp2KZ.)

The strange thing happening now is that regardless of the number of threads (between 1 and 4) the execution time stays more or less the same (+- 10%) but instead the CPU time increases with greater number of threads. That made me think that all the threads do the same work (because I made a mistake regarding OpenMP) and therefore need the same time.

But on another computer (that has 96 CPU cores available) the program behaves as expected: With increasing thread number the execution time decreases. Surprisingly the CPU time decreases as well (up to ~10 threads, then rising again).

It might be that there are different versions of OpenMP or gfortran installed. If this could be the cause it'd be great if you could tell me how to find that out.

解决方案

You could in theory make Fortran array operations parallel by using the Fortran-specific OpenMP WORKSHARE directive:

!$OMP PARALLEL WORKSHARE
A(:,:) = T(:,:) * A(:,:)
!$OMP END PARALLEL WORKSHARE

Note that though this is quite standard OpenMP code, some compilers, and most notably the Intel Fortran Compiler (ifort), implement the WORKSHARE construct simply by the means of the SINGLE construct, giving therefore no parallel speed-up whatsoever. On the other hand, gfortran converts this code fragment into an implicit PARALLEL DO loop. Note that gfortran won't parallelise the standard array notation A = T * A inside the worksharing construct unless it is written explicitly as A(:,:) = T(:,:) * A(:,:).

Now about the performance and the lack of speed-up. Each column of your A and T matrices occupies (2 * 8) * 729 = 11664 bytes. One matrix occupies 8.1 MiB and the two matrices together occupy 16.2 MiB. This probably exceeds the last-level cache size of your CPU. Also the multiplication code has very low compute intensity - it fetches 32 bytes of memory data per iteration and performs one complex multiplication in 6 FLOPs (4 real multiplications, 1 addition and 1 subtraction), then stores 16 bytes back to memory, which results in (6 FLOP)/(48 bytes) = 1/8 FLOP/byte. If the memory is considered to be full duplex, i.e. it supports writing while being read, then the intensity goes up to (6 FLOP)/(32 bytes) = 3/16 FLOP/byte. It follows that the problem is memory bound and even a single CPU core might be able to saturate all the available memory bandwidth. For example, a typical x86 core can retire two floating-point operations per cycle and if run at 2 GHz it could deliver 4 GFLOP/s of scalar math. To keep such core busy running your multiplication loop, the main memory should provide (4 GFLOP/s) * (16/3 byte/FLOP) = 21.3 GiB/s. This quantity more or less exceeds the real memory bandwidth of current generation x86 CPUs. And this is only for a single core with non-vectorised code. Adding more cores and threads would not increase the performance since the memory simply cannot deliver data fast enough to keep the cores busy. Rather, the performance will suffer since having more threads adds more overhead. When run on a multisocket system like the one with 96 cores, the program gets access to more last-level cache and to higher main memory bandwidth (assuming a NUMA system with a separate memory controller in each CPU socket), thus the performance increases, but only because there are more sockets and not because there are more cores.

这篇关于元素矩阵乘法的并行化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆