为什么GPU可以使矩阵乘法比CPU快? [英] Why can GPU do matrix multiplication faster than CPU?

查看:819
本文介绍了为什么GPU可以使矩阵乘法比CPU快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用GPU已有一段时间,没有质疑它,但是现在我很好奇.

I've been using GPU for a while without questioning it but now I'm curious.

为什么GPU进行矩阵乘法的速度比CPU快得多?是因为并行处理吗?但是我没有编写任何并行处理代码.它会自动完成吗?

Why can GPU do matrix multiplication much faster than CPU? Is it because of parallel processing? But I didn't write any parallel processing code. Does it do it automatically by itself?

任何直觉/高级解释将不胜感激!

Any intuition / high-level explanation will be appreciated!

推荐答案

如何并行化计算?

GPU能够执行许多并行计算.比CPU所能做的要多得多. 看这个例子,假设有1M个元素的矢量加法.

How do you parallelize the computations?

GPU's are able to do a lot of parallel computations. A Lot more than a CPU could do. Look at this example of vector addition of let's say 1M elements.

使用CPU,假设您可以运行100个最大线程: (还有100个,但我们假设一会儿)

Using a CPU let's say you have 100 maximum threads you can run : (100 is lot more but let's assume for a while)

在一个典型的多线程示例中,假设您并行化了所有线程上的加法.

In a typical multi-threading example let's say you parallelized additions on all threads.

这是我的意思:

c[0] = a[0] + b[0] # let's do it on thread 0
c[1] = a[1] + b[1] # let's do it on thread 1
c[101] = a[101] + b[101] # let's do it on thread 1

我们之所以能够这样做,是因为c [0]的值不依赖于除a [0]和b [0]之外的任何其他值.因此,每个添加项均独立于其他项.因此,我们能够轻松并行化任务.

We are able to do it because value of c[0], doesn't depend upon any other values except a[0] and b[0]. So each addition is independent of others. Hence, we were able to easily parallelize the task.

如您在上面的示例中看到的,同时进行100个不同元素的所有添加都可以节省您的时间.这样,添加所有元素需要1M/100 = 10,000个步骤.

As you see in above example that simultaneously all the addition of 100 different elements take place saving you time. In this way it takes 1M/100 = 10,000 steps to add all the elements.

现在考虑当今拥有大约2048个线程的GPU,所有线程都可以在恒定时间内独立地执行2048个不同的操作.因此提振了力量.

Now consider today's GPU with about 2048 threads, all threads can independently do 2048 different operations in constant time. Hence giving a boost up.

在矩阵乘法的情况下.您可以并行化计算,因为GPU具有更多的线程,并且在每个线程中您都有多个模块.因此,许多计算是并行的,从而实现了快速计算.

In your case of matrix multiplication. You can parallelize the computations, Because GPU have much more threads and in each thread you have multiple blocks. So a lot of computations are parallelized, resulting quick computations.

但是我没有为GTX1080编写任何并行处理!它是自己完成的吗?

But I didn't write any parallel processing for my GTX1080! Does it do it by itself?

几乎所有的机器学习框架都使用并行化实现所有可能的操作.这是通过CUDA编程(NVIDIA API在NVIDIA GPU上进行并行计算)实现的.您没有明确地编写它,而所有这些都是在低级完成的,甚至您都不知道.

Almost all the framework for machine learning uses parallelized implementation of all the possible operations. This is achieved by CUDA programming, NVIDIA API to do parallel computations on NVIDIA GPU's. You don't write it explicitly, it's all done at low level, and you do not even get to know.

是的,这并不意味着您编写的C ++程序将自动并行化,仅仅是因为您具有GPU. 不,您需要使用CUDA编写它,只有这样它才能被并行化,但是大多数编程框架都具有它,因此从头开始就不需要它.

Yes it doesn't mean that a C++ program you wrote will automatically be parallelized, just because you have a GPU. No, you need to write it using CUDA, only then it will be parallelized, but most programming framework have it, So it is not required from your end.

这篇关于为什么GPU可以使矩阵乘法比CPU快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆