安排OpenCL的内存 [英] Arranging memory for OpenCL

查看:86
本文介绍了安排OpenCL的内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我大约有10个n个项目的numpy个数组.全局ID为i的OpenCL worker仅查看每个数组的第i个元素.我应该如何安排记忆?

I have about 10 numpy arrays of n items. OpenCL worker with global id i only looks at the ith element of each array. How should I arrange the memory?

我当时正在考虑交错图形卡上的阵列,但是由于我不了解工作组内存访问模式,因此我不确定这样做是否会提高性能.

I was thinking of interleaving the arrays on the graphics card, but I'm not sure if this will have any performance gains since I don't understand the workgroup memory access pattern.

推荐答案

我不熟悉numpy,但是如果满足以下条件:

I'm not familiar with numpy, however if:

  • 全局ID为i的线程查看第i个元素(如您所述)
  • 数据类型具有适当的内存对齐方式(4,8,16)
  • 每个线程一次读取32、64、128位
  • the thread with global id i looks at ith element (as you mentioned)
  • the data type has a proper memory alignment (4, 8, 16)
  • each thread reads 32, 64, 128 bit at once

由于合并的内存访问,您应该能够实现最佳的内存吞吐量.在这种情况下,交织不会带来任何性能提升.

you should be able to achieve optimal memory throughput because of coalesced memory access. In this case interleaving wont bring any performance gain.

如果没有满足最后两点之一的要求,并且您可以通过交织实现这些目标,则可以看到性能提升.

If one of the last two points is not fulfilled, and you may be able to achieve them by interleaving you could see a performance gain.

编辑:数组结构(SoA)与结构数组(AoS)

Struct of Arrays (SoA) vs. Array of Structs (AoS)

这一点经常可以在文学中找到.我会简短地说:

This point could be found in literature quite often. I'll make it short:

为什么SoA优于AoS?想象一下,您有10个32位数据类型的数组. AoS解决方案如下:

Why is an SoA preferable to an AoS? Imagine, you have 10 arrays of a 32bit data type. The AoS solution would be as followed:

struct Data
{
   float a0;
   float a1; 
   ...
   float a9;
}; // 10 x 32bit = 320 bit 

struct Data array[512];

读取的存储器将如何显示?内存未对齐且未作任何更改,无法合并内存传输.但是,需要阅读的代码很短:

How will a memory read look like? The memory is ill aligned without any changes the memory transfer could not be coalesced. However, the code needed to read is quite short:

Data a = array[i];

幸运的是,编译器足够聪明,至少可以合并一些已读指令.一个选项是显式的内存对齐.这将使您惯用全局内存,而这在GPU上是非常有限的.

With some luck the compiler is smart enough to, at least, merge some of the read instructions. An option would be explicit memory alignment. This will coast you global memory, which is very limited on GPUs.

现在是SoA解决方案:

Now the SoA solution:

struct Data
{
    float a0[512];
    float a1[512]; 
    ...
    float a9[512];
};

struct Data array;

访问内存的工作稍微复杂一点,但是每次访问都可以合并到合并的读取中,并且不需要内存对齐.您也可以不用理会该结构,而可以按原样使用每个数组,而不会出现任何性能问题.

The work to access the memory is a little bit more complex, however every access could be combined in a coalesced read and no memory alignment is needed. You can also just forget about the struct and use each array as it is without any performance issues.

可以使用的另一种方法是矢量化数据类型(如果您的numpy数组允许这样做).您可以使用float2,float4(或其他简单数据类型,例如int,double ...)来利用组合的内存传输,即,对float4数组的每次读取都将合并到128位内存传输中,从而最大化内存吞吐量.

Another thing that could be used are vectorized data types (if your numpy arrays allow this). You can use a float2, float4 (or other simple data types like int, double ...) to exploit combined memory transfers, i.e. every read to a float4 array would be coalesced in a 128 bit memory transfer maximizing memory throughput.

这篇关于安排OpenCL的内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆