是1D纹理存储器访问比1D全局内存访问更快吗? [英] Is 1D texture memory access faster than 1D global memory access?

查看:230
本文介绍了是1D纹理存储器访问比1D全局内存访问更快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我测量标准和1Dtexture访问内存之间的差异。为此,我创建了两个内核

I am measuring the difference between the standard and 1Dtexture access to memory. To do so I have created two kernels

__global__ void texture1D(float* doarray,int size)
{
  int index;
  //calculate each thread global index
  index=blockIdx.x*blockDim.x+threadIdx.x;
  //fetch global memory through texture reference
  doarray[index]=tex1Dfetch(texreference,index);
  return;
}
__global__ void standard1D(float* diarray, float* doarray, int size)
{
  int index;
  //calculate each thread global index
  index=blockIdx.x*blockDim.x+threadIdx.x;
  //fetch global memory through texture reference
  doarray[index]= diarray[index];
  return;
}

然后,我调用eache内核测量所需的时间:

Then, I call eache kernel measuring the time it takes:

//copy array from host to device memory
  cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);

  checkCuda( cudaEventCreate(&startEvent) );
  checkCuda( cudaEventCreate(&stopEvent) );
  checkCuda( cudaEventRecord(startEvent, 0) );

  //bind texture reference with linear memory
  cudaBindTexture(0,texreference,diarray,sizeof(float)*size);

  //execute device kernel
  texture1D<<<(int)ceil((float)size/threadSize),threadSize>>>(doarray,size);

  //unbind texture reference to free resource
  cudaUnbindTexture(texreference);

  checkCuda( cudaEventRecord(stopEvent, 0) );
  checkCuda( cudaEventSynchronize(stopEvent) );

  //copy result array from device to host memory
  cudaMemcpy(horray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost);

  //check result
  checkResutl(horray, harray, size);

  cudaEvent_t startEvent2, stopEvent2;
  checkCuda( cudaEventCreate(&startEvent2) );
  checkCuda( cudaEventCreate(&stopEvent2) );
  checkCuda( cudaEventRecord(startEvent2, 0) );
  standard1D<<<(int)ceil((float)size/threadSize),threadSize>>>(diarray,doarray,size);
  checkCuda( cudaEventRecord(stopEvent2, 0) );
  checkCuda( cudaEventSynchronize(stopEvent2) );

  //copy back to CPU
  cudaMemcpy(horray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost);

并打印结果:

  float time,time2;
  checkCuda( cudaEventElapsedTime(&time, startEvent, stopEvent) );
  checkCuda( cudaEventElapsedTime(&time2, startEvent2, stopEvent2) );
  printf("Texture  bandwidth (GB/s): %f\n",bytes * 1e-6 / time);
  printf("Standard bandwidth (GB/s): %f\n",bytes * 1e-6 / time2);

事实证明,没有什么重要的数组大小> size ),标准带宽总是高得多。
这是怎么假设是或我在某个时刻拧紧它?
我对纹理内存访问的理解是它可以加速全局内存访问。

It turns out that, no matters the size of the array I am allocating (size), the standard bandwidth is always much higher. Is that how it suppose to be or am I screwing it up at some point? My understanding of Texture memory access was that it can speed up global memory access.

推荐答案

在全局存储器和纹理存储器之间(仅用于高速缓存,不用于过滤)用于1D复值函数的内插。

I have made a comparison between global memory and texture memory (used for caching purposes only, and not for filtering) for the interpolation of a 1D complex valued function.

我要比较的内核使用全局内存 4 2 和使用纹理 2 记忆。它们根据访问复杂值的方式( 1 float2 2 floats )区分,并在下面报告。我将在某个地方发布完整的Visual Studio 2010,以防有人喜欢批评或执行自己的测试。

The kernels I'm comparing are the 4, 2 using global memory and 2 using texture memory. They are distinguished according to the way complex values are accessed (1 float2 or 2 floats) and are reported below. I will post somewhere the full Visual Studio 2010 in case someone like to make some criticisms or perform his own testing.

__global__ void linear_interpolation_kernel_function_GPU(float* __restrict__ result_d, const float* __restrict__ data_d, const float* __restrict__ x_out_d, const int M, const int N)
{
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if(j<N)
    {
        float reg_x_out = x_out_d[j/2]+M/2;
        int k = __float2int_rz(reg_x_out);
        float a = reg_x_out - __int2float_rz(k);
        float dk = data_d[2*k+(j&1)];
        float dkp1 = data_d[2*k+2+(j&1)];
        result_d[j] = a * dkp1 + (-dk * a + dk);
    } 
}

__global__ void linear_interpolation_kernel_function_GPU_alternative(float2* __restrict__ result_d, const float2* __restrict__ data_d, const float* __restrict__ x_out_d, const int M, const int N)
{
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if(j<N)
    {
        float reg_x_out = x_out_d[j]+M/2;
        int k = __float2int_rz(reg_x_out);
        float a = reg_x_out - __int2float_rz(k);
        float2 dk = data_d[k];
        float2 dkp1 = data_d[k+1];
        result_d[j].x = a * dkp1.x + (-dk.x * a + dk.x);
        result_d[j].y = a * dkp1.y + (-dk.y * a + dk.y);
    } 
}

__global__ void linear_interpolation_kernel_function_GPU_texture(float2* __restrict__ result_d, const float* __restrict__ x_out_d, const int M, const int N)
{
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if(j<N)
    {
        float reg_x_out = x_out_d[j]+M/2;
        int k = __float2int_rz(reg_x_out);
        float a = reg_x_out - __int2float_rz(k);
        float2 dk = tex1Dfetch(data_d_texture,k);
        float2 dkp1 = tex1Dfetch(data_d_texture,k+1);
        result_d[j].x = a * dkp1.x + (-dk.x * a + dk.x);
        result_d[j].y = a * dkp1.y + (-dk.y * a + dk.y);
    } 
}

__global__ void linear_interpolation_kernel_function_GPU_texture_alternative(float* __restrict__ result_d, const float* __restrict__ x_out_d, const int M, const int N)
{
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if(j<N)
    {
        float reg_x_out = x_out_d[j/2]+M/4;
        int k = __float2int_rz(reg_x_out);
        float a = reg_x_out - __int2float_rz(k);
        float dk = tex1Dfetch(data_d_texture2,2*k+(j&1));
        float dkp1 = tex1Dfetch(data_d_texture2,2*k+2+(j&1));
        result_d[j] = a * dkp1 + (-dk * a + dk);
    } 
}



我已经考虑过4个不同的GPU,即GeForce GT540M (cc 2.1),Tesla C2050(cc 2.0),Kepler K20c(cc 3.5)和GT210(cc 1.2)。结果报告在下图中。可以看出,使用纹理作为具有较旧计算能力的缓存,提高了对全局内存的使用,而这两个解决方案对于最新的体系结构是相当的。

I have considered 4 different GPUs, namely, GeForce GT540M (cc 2.1), Tesla C2050 (cc 2.0), Kepler K20c (cc 3.5) and GT210 (cc 1.2). The results are reported in the figures below. As it can be seen, using textures as cache with older compute capabilities improves over the use of global memory, while the two solutions are pretty equivalent for the newest architecture.

当然,这个例子不是详尽的,并且在实践中可能存在其他情况,当前者或后者应该优选用于特定应用。

Of course, this example is not exhaustive and there may be in practice other cases when the former or the latter should be preferred for particular applications.

ps处理时间以[ms]而不是[s],如图所示。

p.s. The processing times are in [ms] and not in [s] as indicated in the figure labels.



< img src =https://i.stack.imgur.com/6HHMp.jpgalt =GeForce GT540M>

这篇关于是1D纹理存储器访问比1D全局内存访问更快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆