OpenCL CPU设备与GPU设备 [英] OpenCL CPU Device vs GPU Device

查看:73
本文介绍了OpenCL CPU设备与GPU设备的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑一个简单的例子:向量加法.

如果我为CL_DEVICE_TYPE_GPU构建程序,并且为CL_DEVICE_TYPE_CPU构建相同程序,它们之间有什么区别(除了"CPU程序"在CPU上运行,而"GPU程序"在GPU上运行)? /p>

感谢您的帮助.

解决方案

设备类型之间有一些区别.向量问题的简单答案是:对于大型向量使用gpu,对于较小的工作负载使用cpu.

1)内存复制. GPU依靠您正在处理的数据将其传递给它们,然后将结果随后读回到主机.这是通过PCI-e完成的,对于2.0/2.1版,该速率约为5GB/s. CPU可以使用CL_MEM_ALLOC_HOST_PTR或CL_MEM_USE_HOST_PTR标志之一在DDR3中就地"使用缓冲区.参见此处: clCreateBuffer .这是许多内核的最大瓶颈之一.

2)时钟速度.目前,CPU的时钟速度已超过GPU.大多数cpus的低端为2Ghz,而如今大多数GPU的高端为1Ghz.这是真正有助于cpu在小型工作负载上战胜" gpu的一个因素.

3)并发的线程".高端GPU通常比CPU拥有更多的计算单元.例如,6970 gpu(Cayman)具有24个opencl计算单元,每个计算单元都分为16个SIMD单元.大多数顶级台式机cpus具有8个核心,服务器cpus当前停止于16个核心. (cpu核心将1:1映射到计算单位计数)opencl中的计算单位是设备的一部分,可以完成与设备其余部分不同的工作.

4)线程类型. GPU具有SIMD架构,其中包含许多面向图形的指令. cpus的许多领域专用于分支预测和常规计算.一个cpu可能在每个内核中都有一个SIMD单元和/或浮点单元,但是上面提到的Cayman芯片有1536个单元,每个gpu指令集都可用. AMD称它们为流处理器,并且上述每个SIMD单元中有4个(24x16x4 = 1536).除非制造商想要减少一些缓存或分支预测硬件,否则任何cpu都不会拥有那么多sin(x)或具有点积功能的单元. GPU的SIMD布局可能是大型矢量加法情况下的最大胜利".另外还执行其他专门功能是很大的收获.

5)内存带宽. DDR3的cpus:〜17GB/s高端gpu> 100GB/s,超过200GB/s的速度最近变得很普遍.如果您的算法不受PCI-e限制(请参阅#1),则gpu在原始内存访问中将超过cpu. GPU中的调度单元可以通过仅运行不等待内存访问的任务来进一步隐藏内存延迟. AMD称其为波前,Nvidia称其为翘曲. cpus具有庞大而复杂的缓存系统,以在程序重新使用数据时帮助隐藏其内存访问时间.对于向量添加问题,您可能会受到PCI-e总线的更多限制,因为向量通常每次仅使用一次或两次.

6)功率效率.一个gpu(正确使用)通常会比一个cpu具有更高的电效率.由于cpus在时钟速度中占主导地位,因此真正降低功耗的唯一方法之一就是降低芯片的时钟频率.显然,这会导致更长的计算时间. 绿色500强"列表中的许多顶级系统在很大程度上都采用了GPU加速.参见此处: green500.org

Consider a simple example: vector addition.

If I build a program for CL_DEVICE_TYPE_GPU, and I build the same program for CL_DEVICE_TYPE_CPU, what is the difference between them(except that "CPU program" is running on CPU, and "GPU program" is running on GPU)?

Thanks for your help.

解决方案

There are a few differences between the device types. The simple answer to your vector question is: Use a gpu for large vectors, and cpu for smaller workloads.

1) Memory copying. GPUs rely on the data you are working on to be passed into them, and the results are later read back to the host. This is done over PCI-e, which yields about 5GB/s for version 2.0 / 2.1. CPUs can use buffers 'in place' - in DDR3 - using either of the CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR flags. See here: clCreateBuffer. This is one of the big bottlenecks for many kernels.

2) Clock speed. cpus currently have a big lead over gpus in clock speed. 2Ghz on the low end for most cpus, vs 1Ghz as a top end for most gpus these days. This is one factor that really helps the cpu 'win' over a gpu for small workloads.

3) Concurrent 'threads'. High-end gpus usually have more compute units than their cpu counterparts. For example, the 6970 gpu (Cayman) has 24 opencl compute units, each of these is divided into 16 SIMD units. Most of the top desktop cpus have 8 cores, and server cpus currently stop at 16 cores. (cpu cores map 1:1 to compute unit count) A compute unit in opencl is a portion of the device which can do work that is different from the rest of the device.

4) Thread types. gpus have a SIMD architecture, with many graphic-oriented instructions. cpus have a lot of their area dedicated to branch prediction and general computations. A cpu may have a SIMD unit and/or floating point unit in every core, but the Cayman chip I mentioned above has 1536 units with the gpu instruction set available to each one. AMD calls them stream processors, and there are 4 in each of the SIMD units mentioned above (24x16x4 = 1536). No cpu will have that many sin(x) or dot-product-capable units unless the manufacturer wants to cut out some cache memory or branch prediction hardware. The SIMD layout of the gpus is probably the largest 'win' for large vector addition situations. That the also do other specialized functions is a big bonus.

5) Memory Bandwidth. cpus with DDR3: ~17GB/s. High-end gpus >100GB/s, speeds of over 200GB/s are becoming common lately. If your algorithm is not PCI-e limited (see #1), the gpu will outpace the cpu in raw memory access. The scheduling units in a gpu can hide memory latency further by running only tasks that aren't waiting on memory access. AMD calls this a wavefront, Nvidia calls it a warp. cpus have a large and complicated caching system to help hide their memory access times in the case where the program is reusing the data. For your vector add problem, you will likely be limited more by the PCI-e bus since the vectors are generally used only once or twice each.

6) Power efficiency. A gpu (used properly) will usually be more electrically efficient than a cpu. Because cpus dominate in clock speed, one of the only ways to really reduce power consumption is to down-clock the chip. This obviously leads to longer compute times. Many of the top systems on the Green 500 list are heavily gpu accelerated. see here: green500.org

这篇关于OpenCL CPU设备与GPU设备的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆