OpenCL中本地内存的优势是什么? [英] What's the advantage of the local memory in OpenCL?

查看:157
本文介绍了OpenCL中本地内存的优势是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道本地内存的优势。由于全局内存可以单独和自由地获取项目。我们不能只使用全局内存吗?

I'm wondering the advantage of the local memory in it. Since the global memory can get the item separately and freely. Can't we just use the global memory?

例如,我们有1000 * 1000的图像,我们想要添加每个像素值1.我们可以使用1000 * 1000的全局内存吗?

For example, we have a 1000*1000 image, and we want add every pixel value 1. We can use 1000*1000's global memory right?

如果我们使用本地内存并将1000 * 1000图像转换为100 100 * 100个部分,我们会更快吗?

Will it be faster for us if we use local memory and turn the 1000*1000 image into 100 100*100 parts?

如果你给我一个简单的本地记忆代码,我会非常感谢你。

I'll be so appreciate for you, if you give me a simple code of the local memory.

推荐答案


我们不能只使用全局内存吗?

Cann't we just use the global memory?

当然可以。首先写一个实际的工作代码。然后优化。

Of course you can. First write an actual working code. Then optimize.


因为全局内存可以单独自由地获取项目

Since the global memory can get the item separately and freely

我不确定所有架构是否都具有广播能力。
但我确定所有线程是否随机访问内存,它会变得太慢。
光线跟踪就是一个例子。每个像素折射/反射到不同的距离和不同的存储区域。这是一个性能打击。如果每个线程都以统一的方式访问全局内存,那么它会快得多。

Im not sure if all architectures have broadcasting ability. But Im sure if memory is accessed randomly for all threads, it gets too slow. Ray tracing is an example. Each pixel refracts/reflected to different distances and different memory areas. This is a performance hit. If every thread was accessing to global memory in a uniform way, it would be much faster.


我们可以使用1000 * 1000的全局内存权?

We can use 1000*1000's global memory right?

最大缓冲区大小 最小值,它可以大约128MB或1/4的设备内存。所有缓冲区的组合大小可能因平台/设备而异,在几GB范围内。

There is a minimum value of maximum buffer size and it can be around 128MB or 1/4 of device memory. Combined size of all buffers can vary with platforms/devices, in the range of several GBs.


如果我们使用它们会更快吗?本地内存并将1000 * 1000
图像转换为100 100 * 100个部分?

Will it be faster for us if we use local memory and turn the 1000*1000 image into 100 100*100 parts?

这取决于数据重新使用率和访问模式的合并性。对本地内存的随机(非合并)访问比对全局内存的随机(非合并)访问快得多。如果你使用过多的本地内存/私有文件,那么它可能会更慢,因为更多的本地内存消耗会导致更少的占用和更少的内存延迟隐藏以及更多的寄存器溢出到全局内存。尝试使用私有寄存器来平衡它。或者你可以使用压缩技术将更多数据放入本地内存。

That depends on the data re-use ratio and coalescedness of access pattern. Random(non coalesced) access to local memory is much faster than random(non coalesced) access to global memory. If you use too much local memory/private file, then it can be even slower because more local memory consumption leads to less occupation and less memory latency hiding and more register spilling to global memory. Try to balance it with using private registers too. Or you can use a compression technique to fit more data into local memory.

如果重复使用每个数据,比如256次,那么它将在10-左右本地内存比全局内存访问快20倍。

If you re-use each data for lets say 256 times, then it will be around 10-20x faster for local memory than global memory access.

这是一个非常简单的用于力计算的2D nbody代码:

Here is a very simple 2D nbody code for force calculations:

// global memory access is only 257 times per item, 1 for private save
//                                                  256 for global broadcast
//                                                  for global-to-local copy
// unoptimized version accesses 65537 times per item.
__kernel void nBodyF(__global float *x, __global float *y,
                     __global float *vx, __global float *vy,
                     __global float *fx, __global float *fy)
{
     int N=65536; // this is total number of masses for this example
     int LN=256;  // this is length of each chunk in local memory, 
                  // means 256 masses per compute unit
    int i=get_global_id(0);  // global thread id keys 0....65535
    int L=get_local_id(0);   // local thread id keys 0...255 for each group
    float2 Fi=(float2)(0,0); // init
    float xi=x[i]; float yi=y[i]; // re-use for 65536 times
    __local xL[256]; __local yL[256]; //declare local mem array with constant length


    for(int k=0;k<N/LN;k++) // number of chunks to fetch from global to local
    {
        barrier(CLK_LOCAL_MEM_FENCE);  //synchronization
        xL[L]=x[k*LN+L]; yL[L]=y[k*LN+L]; //get 256-element chunks into local mem
        barrier(CLK_LOCAL_MEM_FENCE);  //synchronization
        for(int j=0;j<LN;j++)          //start processing local/private variables
        {
            float2 F=(float2)(0,0);          // private force vector init
            float2 r1=(float2)(xi,yi);       // private vector
            float2 r2=(float2)(xL[j],yL[j]); // use local mem to get r2 vector
            float2 dr=r1-r2;                 // private displacement
            F=dr/(0.01f+dot(dr,dr));         // private force calc.
            Fi.x-=F.x; Fi.y-=F.y;            // private force add to private
        }
     }
     fx[i]=Fi.x; fy[i]=Fi.y; //write result to global mem only once
}

上面的例子很差本地内存重用率。但是一半的变量是在私有内存中,并重新使用64k次。

The upper example is poor in terms of local memory re-use ratio. But half of the variables is in private memory and is re-used for 64k times.

最坏情况:

  1)Big portion of items cannot fit GPU cache.
  2)Only global memory accesses are done
  3)Data is not re-used
  4)Memory is accessed in a very non-uniform way.
  This will make it very slow.
  When data doesnt fit cache and not re-used, you should use __read_only for
  necessary buffers(__write_only for writing).

如果进行卷积(或某些抗锯齿或边缘检测),数据重复使用将是4到20,本地内存优化至少提供3-4倍的性能。

If you make a convolution(or some anti-aliasing, or edge detection), data re-use will be 4 to 20 and local memory optimization gives 3-4x performance at least.

如果您的GPU具有300GB / s的全局内存带宽,那么它的本地内存带宽将是大约3-4 TB / s。您也可以优化私人寄存器!那么它可能是15-20 TB / s。但是这种类型的使用区域较少。

If your GPU has 300GB/s global memory bandwidth, then its local memory bandwidth would be around 3-4 TB/s. You can optimize for private registers too! Then it could be 15-20 TB/s. But this type has fewer usage areas.

编辑:如果您正在读取单个字节,并且如果这些字节之间只有一个小值(例如最大值16)相差不大,那么您可以将多个变量打包成单个字节,并在本地存储器中解密它们。示例:

If you are reading single bytes and if these bytes differ by only a small value(e.g. maximum 16) between them, then you can pack multiple variables into single bytes and decrypt them in local memoru. Example:

  Global memory(copied to local mem): 
  Reference_byte   Byte0  byte1        byte2         byte3  
  128              +3,-5  +24,+50      -25,-63      0, +2

  Unpacking in local memory:
  Reference_byte   Byte0  byte1 byte2 byte3 Byte4  byte5 byte6 byte7      
  128              131    126   150   200   175    112   112   114

  Computing results on the array
  Reference_byte   Byte0  byte1 byte2 byte3 Byte4  byte5 byte6 byte7 
  128              120    130   140   150   150    150   100   110

  Packing results in local memory:
  Reference_byte   Byte0  byte1        byte2         byte3  
  128              -8,+10 +10,+10      0,0           -50, +10

  Global memory(copied from local mem): 
  Reference_byte   Byte0  byte1        byte2         byte3  
  128              -8,+10 +10,+10      0,0           -50, +10

  //Maybe a coordinate compression for a voxel rendering.

使用可为您提供缓存行使用信息的分析器。

Use a profiler that gives you cache line usage info.

这篇关于OpenCL中本地内存的优势是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆