阿姆达尔定律和GPU [英] Amdahl's law and GPU

查看:95
本文介绍了阿姆达尔定律和GPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于阿姆达尔定律在GPU方面的适用性,我有两个疑问。例如,我有一个用多个线程启动的内核代码,例如N。那么,根据阿姆达尔定律,处理器的数量将是N吗?另外,对于使用大量线程的任何CUDA编程,我是否可以安全地假设Amdahl定律简化为1 /(1-p),其中p代表并行代码?
谢谢

I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code? Thanks

推荐答案


例如,我有一个内核代码因此,在阿姆达尔定律中,处理器
的数量将为N对吗?

For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right?

不完全是。 GPU的物理核心( K )不如您可以启动的线程数( N )(通常, K 大约为10) 3 N 的范围是10 4 -10 6 )。但是,内核时间的很大一部分(通常)只花在等待数据从全局内存读取/写入数据上,因此一个内核可以无缝处理多个线程。这样,设备最多可以处理 N 0 个线程而不会相互干扰,其中 N 0 是通常比 K 大几倍,但实际上取决于您的内核功能。

Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.

在我看来,确定此的最佳方法N 0 用于实验测量应用程序的性能,然后使用此数据拟合阿姆达尔定律的参数:)

In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)


同样,对于使用大量线程的任何CUDA编程,
对我来说可以安全地假设阿姆达尔定律被简化为1 /(1-p)
,其中p代表并行代码?

Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?

这个假设基本上意味着您忽略了代码并行部分的时间(它无限快地执行),只考虑时间

This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.

例如如果您在GPU上计算两个100个元素的向量之和,则设备初始化,数据复制,内核启动开销等(串行部分)要比内核执行(并行部分)花费更多的时间。但是,通常情况并非如此。

E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.

此外,单个GPU内核的性能与CPU内核不同,因此您应该进行一些扩展,使Amdah'l律 1 / / [(1-p)+ k * p / N] (最简单的是, k = Frequency(CPU)/ Frequency(GPU) ),有时 k 会更多地增加,以考虑架构差异,例如具有SIMD块的CPU内核。)

Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N] (at it's simplest, k = Frequency(CPU) / Frequency(GPU), sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).

我也可以反对将阿姆达尔定律直接应用于实际系统。当然,它显示了总体趋势,但是并没有掌握一些非平凡的过程。

I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.

首先,阿姆达尔定律假设在给定无限数量的内核的情况下,并行部分将被执行即刻。这个假设是不正确的(尽管有时可能很准确)。即使您计算两个向量的总和,也无法比将两个字节相加所需的速度更快。可以忽略此量子,也可以将其包含在算法的串行部分中,但在某种程度上破坏了这一想法。

First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.

如何在阿姆达尔定律中正确估计效果据我所知,屏障同步,临界区,原子操作等问题尚未解决。这样的操作属于并行部分,但是执行的时间最多与线程数无关,最坏的情况是正相关。

How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.

简单的例子:广播时间CPU群集中计算节点之间的比例为 O(log N)。某些初始初始化可能需要最多 O(N)时间。

Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N). Some initial initialization can take up to O(N) time.

在简单的情况下,可以稍微估计一下收益算法的并行化,但是(通常是CUDA的情况),使用并行处理的静态开销可能要花费更多的时间,而不是并行处理本身节省的时间。

In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.

所以我认为,编写应用程序,测量其性能并使用其绘制Amdahl曲线通常比尝试先验正确估计算法和硬件的所有细微差别要容易得多。如果可以很容易地做出这样的估计,通常这些估计是显而易见的,没有任何法律。

So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".

这篇关于阿姆达尔定律和GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆