矩阵乘法 [英] matrix multiplication in cuda

查看:157
本文介绍了矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我想将两个矩阵乘以50乘50.我有两种方式来排列线程和块。



a)一个线程计算每个元素结果矩阵。所以我有一个循环,在线程乘以一行和一列。



b)一个线程做每个乘法。结果矩阵的每个元素需要50个线程。在乘法完成后,我可以使用二进制减法来求和结果。



我不知道采取哪种方式,所以我拿了b。这不是理想。事实上它很慢。任何想法为什么?

解决方案

如果你想要的资源大部分时间在等待资源,在高性能计算中有这么多东西,理解性能的关键是理解内存的使用。



如果你使用一个线程做一个乘法,对于那个线程,你必须从内存中提取两个数据,将它们相乘,然后做一些对数的加法。这是三个存储器访问为一个多,一个加和一个 - 算术强度非常低。好消息是,有这么多线程值的任务这样,每个只需要一点点的内存/寄存器,这是有利的占用;但是内存访问的工作比较差。



简单的一个线程做一个点积的方法有同样的问题 - 每个乘法需要两次内存访问加载。好消息是,对于整个点积,只有一个存储到全局内存,并且避免了二进制减少,它不能扩展,需要很多同步;



现在你知道应该有一些方法来做更多的事情每个内存访问的操作比这个;对于正方形NxN矩阵,有N ^ 3个工作来执行乘法,但只有3xN ^ 2个元素 - 所以你应该能找到一种方法来做2次以上的内存访问的计算。



在CUDA SDK中采用的方法是最好的方法 - 基数被分解成瓷砖,并且您的(b)方法 - 每个输出元素一个线程 - 被使用。但关键是如何排列线程。通过将整个小的子矩阵从慢全局内存中提取到共享内存,并从中进行计算,可以对内存中读取的每个数字执行许多乘法和加法。这种方法在许多应用中是最成功的方法,因为获取数据 - 无论是通过网络,还是从CPU的主存储器或GPU的片外访问 - 通常需要比处理数据更长的时间。 p>

在NVidia的CUDA页面中有文档(参见 http: //developer.nvidia.com/object/cuda_training.html ),它们非常好地描述了他们的SDK示例。


say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.

a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.

b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.

I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?

解决方案

As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.

If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.

The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.

Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.

The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.

There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.

这篇关于矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆