OpenCL本地内存大小和计算单元数 [英] OpenCL local memory size and number of compute units
问题描述
每个GPU设备(AMD,NVidea或任何其他设备)都分为几个计算单元(多处理器),每个计算单元都有固定数量的内核(VertexShaders/StreamProcessors).因此,一个计算机具有(Compute Units) x (VertexShaders/compute unit)
个并发处理器,但每个MultiProcessor仅具有少量固定的__local
内存(通常为16KB或32KB).因此,这些多处理器的确切数量很重要.
Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit)
simultaneous processors to compute with, but there is only a small fixed amount of __local
memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters.
现在我的问题:
- (a)如何知道设备上的多处理器数量?这和
CL_DEVICE_MAX_COMPUTE_UNITS
一样吗?我可以从诸如 http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units? - (b)在购买GPU之前,我如何知道每MP MP有多少
__local
内存?当然,我可以在运行它的计算机上请求CL_DEVICE_LOCAL_MEM_SIZE
,但是我什至看不到如何从单独的详细规格表(例如
- (a) How can I know the number of multiprocessors on a device? Is this the same as
CL_DEVICE_MAX_COMPUTE_UNITS
? Can I deduce it from specification sheets such as http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units? - (b) How can I know how much
__local
memory per MP there is available on a GPU before buying it? Of course I can requestCL_DEVICE_LOCAL_MEM_SIZE
on a computer that runs it, but I don't see how I can deduce it from even an individual detailed specifications sheet such as http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#3? - (c) What is the card with currently the largest
CL_DEVICE_LOCAL_MEM_SIZE
? Price doesn't really matter, but 64KB (or larger) would give a clear benefit for the application I'm writing, since my algorithm is completely parallelizable, but also highly memory-intensive with random access pattern within each MP (iterating over edges of graphs).
推荐答案
-
CL_DEVICE_MAX_COMPUTE_UNITS
应该为您提供ComputeUnit的数量,否则您可以从相应的手册中浏览一下( AMD opencl编程指南和此文档,它可以提供本地内存对于基于G80和G200的GPU,大小为16kB/CU.对于基于费米的卡(GF100),有64kB的片上存储器可用,可以将其配置为48kB本地存储器和16kB L1高速缓存或16kB本地存储器和48kB L1高速缓存.此外,基于费米的卡具有最高768kB的L2缓存(根据Wikipedia,GF100和GF110为768kB,GF104和GF114为512kB,GF106和GF116为384kB,GF108和GF118没有). - 从以上信息可以看出,当前的nvidia卡似乎每个计算单元的本地内存最多.此外,据我了解,它是唯一具有通用L2缓存的服务器.
CL_DEVICE_MAX_COMPUTE_UNITS
should give you the number of ComputeUnits, otherwise you can glance it from appropriate manuals (the AMD opencl programming guide and the Nvidia OpenCL programming guide)- The linked guide for AMD contains information about the availible local memory per compute unit (generally 32kB / CU). For NVIDIA a quick google search revealed this document, which gives the local memory size as 16kB/CU for G80 and G200 based GPUs. For fermi based cards (GF100) there are 64kB of onchip memory availible, which can be configured as either 48kB local memory and 16kB L1 cache or 16kB local memory and 48kB L1 cache. Furthermore fermi based cards have an L2 cache of upto 768kB (768kB for GF100 and GF110, 512kB for GF104 and GF114 and 384kB for GF106 and GF116, none for GF108 and GF118 according to wikipedia).
- From the informations above it would seem that current nvidia cards have the most local memory per compute unit. Furthermore it is the only one with a general L2 Cache from my understanding.
但是,对于使用本地内存,您应该记住,本地内存是每个工作组分配的(并且只能供一个工作组访问),而计算单元通常可以容纳一个以上的工作组.因此,如果您的算法将整个本地内存分配给一个工作组,则将无法使用达到最大并行度的功能.还要注意,由于本地内存是存储区,因此随机访问将导致很多存储区冲突和翘曲序列化.因此,您的算法可能不会像您认为的那样好(或可能会,仅提及可能性)就没有那么好的并行度.
For your usage of local memory you should however remember that local memory is allocated per workgroup (and only accessible for a workgroup), while a Compute Unit can typically sustain more then one workgroup. So if your algorithm allocated the whole local memory to one workgroup you will not be able to use achieve the maximum amount of parallelity. Also note that since local memory is banked random access will lead to alot of bank conflicts and warp serializations. So your algorithm might not parallize quite as good as you think it will (or maybe it will, just mentioning the possibility).
如果您的所有工作组都对同一数据进行操作,那么使用基于Fermi的卡,您最好的选择可能是依靠缓存而不是显式的本地内存(尽管我不知道如何切换L1/本地内存配置)
With a Fermi based card your best bet might be to count on the caches instead of explicit local memory, if all your workgroups operate on the same data (I don't know how to switch the L1/local Memory configuration though).
这篇关于OpenCL本地内存大小和计算单元数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!