OpenCL本地内存大小和计算单元数 [英] OpenCL local memory size and number of compute units

查看:68
本文介绍了OpenCL本地内存大小和计算单元数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每个GPU设备(AMD,NVidea或任何其他设备)都分为几个计算单元(多处理器),每个计算单元都有固定数量的内核(VertexShaders/StreamProcessors).因此,一个计算机具有(Compute Units) x (VertexShaders/compute unit)个并发处理器,但每个MultiProcessor仅具有少量固定的__local内存(通常为16KB或32KB).因此,这些多处理器的确切数量很重要.

Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit) simultaneous processors to compute with, but there is only a small fixed amount of __local memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters.

现在我的问题:

  • (a) How can I know the number of multiprocessors on a device? Is this the same as CL_DEVICE_MAX_COMPUTE_UNITS? Can I deduce it from specification sheets such as http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units?
  • (b) How can I know how much __local memory per MP there is available on a GPU before buying it? Of course I can request CL_DEVICE_LOCAL_MEM_SIZE on a computer that runs it, but I don't see how I can deduce it from even an individual detailed specifications sheet such as http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#3?
  • (c) What is the card with currently the largest CL_DEVICE_LOCAL_MEM_SIZE? Price doesn't really matter, but 64KB (or larger) would give a clear benefit for the application I'm writing, since my algorithm is completely parallelizable, but also highly memory-intensive with random access pattern within each MP (iterating over edges of graphs).

推荐答案

  1. CL_DEVICE_MAX_COMPUTE_UNITS应该为您提供ComputeUnit的数量,否则您可以从相应的手册中浏览一下( AMD opencl编程指南此文档,它可以提供本地内存对于基于G80和G200的GPU,大小为16kB/CU.对于基于费米的卡(GF100),有64kB的片上存储器可用,可以将其配置为48kB本地存储器和16kB L1高速缓存或16kB本地存储器和48kB L1高速缓存.此外,基于费米的卡具有最高768kB的L2缓存(根据Wikipedia,GF100和GF110为768kB,GF104和GF114为512kB,GF106和GF116为384kB,GF108和GF118没有).
  2. 从以上信息可以看出,当前的nvidia卡似乎每个计算单元的本地内存最多.此外,据我了解,它是唯一具有通用L2缓存的服务器.
  1. CL_DEVICE_MAX_COMPUTE_UNITS should give you the number of ComputeUnits, otherwise you can glance it from appropriate manuals (the AMD opencl programming guide and the Nvidia OpenCL programming guide)
  2. The linked guide for AMD contains information about the availible local memory per compute unit (generally 32kB / CU). For NVIDIA a quick google search revealed this document, which gives the local memory size as 16kB/CU for G80 and G200 based GPUs. For fermi based cards (GF100) there are 64kB of onchip memory availible, which can be configured as either 48kB local memory and 16kB L1 cache or 16kB local memory and 48kB L1 cache. Furthermore fermi based cards have an L2 cache of upto 768kB (768kB for GF100 and GF110, 512kB for GF104 and GF114 and 384kB for GF106 and GF116, none for GF108 and GF118 according to wikipedia).
  3. From the informations above it would seem that current nvidia cards have the most local memory per compute unit. Furthermore it is the only one with a general L2 Cache from my understanding.

但是,对于使用本地内存,您应该记住,本地内存是每个工作组分配的(并且只能供一个工作组访问),而计算单元通常可以容纳一个以上的工作组.因此,如果您的算法将整个本地内存分配给一个工作组,则将无法使用达到最大并行度的功能.还要注意,由于本地内存是存储区,因此随机访问将导致很多存储区冲突和翘曲序列化.因此,您的算法可能不会像您认为的那样好(或可能会,仅提及可能性)就没有那么好的并行度.

For your usage of local memory you should however remember that local memory is allocated per workgroup (and only accessible for a workgroup), while a Compute Unit can typically sustain more then one workgroup. So if your algorithm allocated the whole local memory to one workgroup you will not be able to use achieve the maximum amount of parallelity. Also note that since local memory is banked random access will lead to alot of bank conflicts and warp serializations. So your algorithm might not parallize quite as good as you think it will (or maybe it will, just mentioning the possibility).

如果您的所有工作组都对同一数据进行操作,那么使用基于Fermi的卡,您最好的选择可能是依靠缓存而不是显式的本地内存(尽管我不知道如何切换L1/本地内存配置)

With a Fermi based card your best bet might be to count on the caches instead of explicit local memory, if all your workgroups operate on the same data (I don't know how to switch the L1/local Memory configuration though).

这篇关于OpenCL本地内存大小和计算单元数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆