阿姆达尔定律和 GPU [英] Amdahl's law and GPU

查看:18
本文介绍了阿姆达尔定律和 GPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对阿姆达尔定律在 GPU 上的应用有一些疑问.例如,我有一个内核代码,我使用多个线程启动,比如 N.那么,在阿姆达尔定律中,处理器的数量将是 N 对吗?此外,对于使用大量线程的任何 CUDA 编程,我可以安全地假设 Amdahl 定律减少到 1/(1-p) 其中 p 代表并行代码吗?谢谢

I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code? Thanks

推荐答案

例如,我有一个带有数字的内核代码线程数,比如 N.所以,在阿姆达尔定律中,处理器的数量会是N吧?

For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right?

不完全是.GPU 的物理内核数 (K) 不如您可以启动的线程数 (N)(通常,K 约为 103, N 在 104 -- 106 的范围内).然而,大部分内核时间(通常)只用于等待从全局内存读取/写入数据,因此一个内核可以无缝处理多个线程.这样,设备可以处理多达 N0 个线程而不会相互干扰,其中 N0 是通常比 K 大几倍,但实际上取决于你的内核函数.

Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.

在我看来,确定这个 N0 的最佳方法是通过实验测量您的应用程序的性能,然后使用这些数据来拟合阿姆达尔定律的参数 :)

In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)

另外,对于任何使用大量线程的 CUDA 编程,是我可以安全地假设阿姆达尔定律减少到 1/(1-p)其中 p 代表并行代码?

Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?

这个假设基本上意味着您忽略了代码并行部分的时间(它执行得无限快),而只考虑串行部分的时间.

This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.

例如如果您在 GPU 上计算两个 100 元素向量的总和,那么初始化设备、数据复制、内核启动开销等(串行部分)比内核执行(并行部分)花费更多的时间.但是,通常情况并非如此.

E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.

另外,单个GPU核心的性能与CPU核心不同,所以你应该做一些缩放,使Amdah'l law 1/[(1-p) + k*p/N](最简单的情况是 k = Frequency(CPU)/Frequency(GPU),有时会增加 k 以考虑架构差异,例如 CPU 内核有 SIMD 块).

Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N] (at it's simplest, k = Frequency(CPU) / Frequency(GPU), sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).

我也可以反对将阿姆达尔定律直接应用于实际系统.当然,它显示了大势所趋,但它没有抓住一些非平凡的过程.

I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.

首先,阿姆达尔定律假设给定无限数量的内核,并行部分会立即执行.这个假设是不正确的(尽管有时它可能非常准确).即使您计算两个向量的和,也无法比添加两个字节的速度更快.可以忽略这个量子",或者将其包含在算法的串行部分中,但它有点打破"这个想法.

First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.

据我所知,如何在阿姆达尔定律中正确估计屏障同步、临界区、原子操作等的影响是未解之谜.此类操作属于并行部分,但其执行的 walltime 充其量与线程数无关,最坏的情况是正相关.

How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.

简单示例:CPU 集群中计算节点之间的广播时间缩放为 O(log N).一些初始初始化可能需要 O(N) 时间.

Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N). Some initial initialization can take up to O(N) time.

在简单的情况下,我们可以在一定程度上估计算法并行化的好处,但是(与 CUDA 一样)使用并行处理的静态开销可能比并行处理本身节省的时间更多.

In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.

因此,在我看来,编写应用程序、测量其性能并使用它来绘制 Amdahl 曲线通常比尝试先验正确估计算法和硬件的所有细微差别更简单.如果可以很容易地做出这样的估计,它们通常是显而易见的,没有任何规律".

So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".

这篇关于阿姆达尔定律和 GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆