Cuda合并了内存负载行为 [英] Cuda coalesced memory load behavior

查看:55
本文介绍了Cuda合并了内存负载行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用结构数组,我希望每个块都在共享内存中加载数组的一个单元.例如:块0将在共享内存中加载array [0],而块1将在共享内存中加载array [1].

I am working with an array of structure, and I want for each block to load in shared memory one cell of the array. For example : block 0 will load array[0] in shared memory and block 1 will load array[1].

为此,我将结构数组强制转换为float *,以尝试合并内存访问.

In order to do that I cast the array of structure in float* in order to try to coalesce memory access.

我有两个版本的代码

版本1

__global__ 
void load_structure(float * label){

  __shared__ float shared_label[48*16];
  __shared__ struct LABEL_2D* self_label;


  shared_label[threadIdx.x*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) +threadIdx.x*16+threadIdx.y];
  shared_label[(threadIdx.x+16)*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) + (threadIdx.x+16)*16+threadIdx.y];
  if((threadIdx.x+32)*16+threadIdx.y < sizeof(struct LABEL_2D)/sizeof(float))  {
    shared_label[(threadIdx.x+32)*16+threadIdx.y] = 
          label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float) +(threadIdx.x+32)*16+threadIdx.y];
   }

  if(threadIdx.x == 0){
    self_label = (struct LABEL_2D *) shared_label;
  }
  __syncthreads();
  return;
}

...

dim3 dimBlock(16,16);
load_structure<<<2000,dimBlock>>>((float*)d_Label;

计算时间:0.740032毫秒

版本2

__global__ 
void load_structure(float * label){

  __shared__ float shared_label[32*32];
  __shared__ struct LABEL_2D* self_label;

  if(threadIdx.x*32+threadIdx.y < *sizeof(struct LABEL_2D)/sizeof(float))
    shared_label[threadIdx.x*32+threadIdx.y] = 
              label[blockIdx.x*sizeof(struct LABEL_2D)/sizeof(float)+threadIdx.x*32+threadIdx.y+];


  if(threadIdx.x == 0){
      self_label = (struct LABEL_2D *) shared_label;
    }
  __syncthreads();
  return;
}

dim3 dimBlock(32,32);
load_structure<<<2000,dimBlock>>>((float*)d_Label);

计算时间:2.559264毫秒

在两个版本中,我都使用nvidia profiler,全局负载效率为8%.

In both version I used the nvidia profiler and the global load efficiency is 8%.

我有两个问题:1-我不明白为什么时间会有所不同.2-我的电话合并了吗?

I have two problems : 1 - I don't understand why there is a difference of timings. 2 - Are my calls coalesced?

我正在使用具有2.1计算能力(32个线程/包)的视频卡

I am using a video card with 2.1 compute capability (32 thread/wraps)

推荐答案

我解决了我的问题,在先前版本中访问内存模式不正确.阅读cuda最佳实践指南的6.2.1段后,我发现如果对齐它们,访问速度会更快.

I solved my problem, the access memory pattern was not correct in the previous version. After reading the paragraph 6.2.1 of the cuda best practise guide, I discover that the access are faster if they are aligned.

为了对齐访问模式,我在结构中添加了一个"fake"变量,以使结构大小可以除以128(现金行).

In order to aligne my access pattern, I added a "fake" variable in the structure in order to have a structure size that can be divided by 128 (cash size line).

通过这种策略,我可以获得良好的性能:为了将2000结构加载到2000块中,仅花费了0.16毫秒.

With this strategie I obtain good performance : In order to load 2000 structure into 2000 block it took only 0.16ms.

这是代码的版本:

struct TEST_ALIGNED{
  float data[745];
  float aligned[23];
}; 


__global__
void load_structure_v4(float * structure){

  // Shared structure within a block
  __shared__ float s_structure[768];
  __shared__ struct TEST_ALIGNED * shared_structure;

  s_structure[threadIdx.x] = 
    structure[blockIdx.x*sizeof(struct TEST_ALIGNED)/sizeof(float) + threadIdx.x];
  s_structure[threadIdx.x + 256] = 
    structure[blockIdx.x*sizeof(struct TEST_ALIGNED)/sizeof(float) + threadIdx.x + 256];
  if(threadIdx.x < 745)
        s_structure[threadIdx.x + 512] = 
            structure[blockIdx.x*sizeof(struct TEST_ALIGNED)/sizeof(float) +    threadIdx.x + 512];
  if(threadIdx.x == 0)
       shared_structure = (struct TEST_ALIGNED*) s_structure;

  __syncthreads();

    return;
}

dim3 dimBlock(256);
load_structure_v4<<<2000,dimBlock>>>((float*)d_test_aligned);

我仍在寻找优化,如果有发现,我会在这里发布.

I am still looking for optimization, and I will post it here if I find some.

这篇关于Cuda合并了内存负载行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆