CUDA:__ restrict__标签用法 [英] CUDA: __restrict__ tag usage

查看:147
本文介绍了CUDA:__ restrict__标签用法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不太了解CUDA中__restrict__标记的概念.

我已经了解到,使用__restrict__可以避免指针混叠,特别是如果指向的变量是只读的,则由于已对其进行了缓存,因此优化了变量的读取.

这是代码的简化版本:

__constant__ float M[M_DIM1][M_DIM2];

__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]);

__global__ void kernel_function(const float* __restrict__ N, float *P);

__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]) {

    int IOSize = DIM1 * DIM2 * sizeof(float);
    int ConstSize = M_DIM1* M_DIM2* sizeof(float);
    float* dN, *dP;
    cudaMalloc((void**)&dN, IOSize);
    cudaMemcpy(dN, N, IOSize, cudaMemcpyHostToDevice);

    cudaMemcpyToSymbol(M, h_M, ConstSize);

    cudaMalloc((void**)&dP, IOSize);

    dim3 dimBlock(DIM1, DIM2);
    dim3 dimGrid(1, 1);

    kernel_function << <dimGrid, dimBlock >> >(dN, dP);

    cudaMemcpy(P, dP, IOSize, cudaMemcpyDeviceToHost);

    cudaFree(dN);
    cudaFree(dP);

}

我在N上使用__restrict__标记是正确的方式吗? 此外,我已经读到M上的关键字__constant__表示它是只读的并且是常量,那么两者之间的区别是什么,分配类型是什么?

nvcc使用的

解决方案

__restrict__已记录在标准.

简而言之,__restrict__是您作为程序员与编译器签订的合同,其中大致表示:我将仅使用此指针来引用基础数据".从编译器的角度来看,这成为表格的关键问题之一就是指针别名,这会阻止编译器进行各种优化.

如果您想要关于restrict__restrict__确切定义的更长篇正式论文,请参考我已经给出的链接之一,或者进行一些研究.

因此,出于优化目的,__restrict__通常对支持它的编译器很有用.

对于计算能力为3.5或更高版本的设备,这些设备具有称为 GPU上的硬件资源.有很多区别:

  • __constant__在所有GPU上均可用,只读缓存仅在cc3.5及更高版本上使用
  • 使用__constant__标记分配的
  • 内存(包含在该行中以指定内存分配)被限制为最大64KB.只读缓存没有这样的限制.我们不将__restrict__放在分配内存的行上.它用于装饰指针.
  • 缓存在只读缓存中的
  • 数据具有典型的全局内存访问注意事项-通常,我们希望通过相邻访问和连续访问来最好地合并通过只读缓存读取的全局内存. OTOH是__constant__机制,期望所谓的 uniform 访问具有最快的性能.统一访问本质上是指扭曲中的每个线程都从相同的位置/地址/索引请求数据.

从内核代码的角度来看,__constant__内存和在传递给内核代码的指针上标记有const装饰器的全局内存都是只读的.

无论使用__restrict__还是其他方法,在显示的代码中我都看不到任何明显的问题.我唯一要说的是,为了获得最大利益,您可能想用__restrict__装饰内核声明/原型中的NP指针,以获得最大利益,如果这是您的意图. (显然,您不会用const装饰P.)

I don't quite understand the concept of the __restrict__ tag in CUDA.

I've read that using __restrict__ avoids pointers aliasing and in particular, if the variable pointed at is read-only, the reading of the variable is optimized because it's cached.

This is a simplified version of the code:

__constant__ float M[M_DIM1][M_DIM2];

__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]);

__global__ void kernel_function(const float* __restrict__ N, float *P);

__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]) {

    int IOSize = DIM1 * DIM2 * sizeof(float);
    int ConstSize = M_DIM1* M_DIM2* sizeof(float);
    float* dN, *dP;
    cudaMalloc((void**)&dN, IOSize);
    cudaMemcpy(dN, N, IOSize, cudaMemcpyHostToDevice);

    cudaMemcpyToSymbol(M, h_M, ConstSize);

    cudaMalloc((void**)&dP, IOSize);

    dim3 dimBlock(DIM1, DIM2);
    dim3 dimGrid(1, 1);

    kernel_function << <dimGrid, dimBlock >> >(dN, dP);

    cudaMemcpy(P, dP, IOSize, cudaMemcpyDeviceToHost);

    cudaFree(dN);
    cudaFree(dP);

}

Am I using the __restrict__ tag on N, that is read-only, in the right way? Furthermore, I've read that the keyword __constant__ on M means that is read-only and constant, so what is the difference between the two of them, the type of allocation?

解决方案

__restrict__ as used by nvcc is documented here. (note that various c++ compilers including gnu compilers also have support for this exact keyword, and use it similarly).

It has essentially the same semantics as the C99 restrict keyword, which is an official part of that language standard.

In a nutshell, __restrict__ is a contract that you as a programmer make with the compiler, which says, roughly, "I will only use this pointer to refer to the underlying data". One of the key things that this takes off the table from the compiler's perspective is pointer aliasing, which can prevent the compiler from being able to make various optimizations.

If you'd like a longer formal treatise on the exact definition of restrict or __restrict__, please refer to one of the links I've already given, or do some research.

So, __restrict__ is generally useful to compilers that support it, for optimization purposes.

For compute capability 3.5 or higher devices, these devices have a separate cache called the read only cache which is independent of normal L1 type caching.

If you use both __restrict__ and const to decorate global pointers passed to a kernel, then this is also a strong hint to the compiler, when generating code for cc3.5 and higher devices, to cause those global memory loads to flow through the read-only cache. This can provide application performance benefits, often with little other code refactoring. This doesn't guarantee usage of the read-only cache, and the compiler will often attempt to aggressively use the read only cache if it can satisfy the necessary conditions, even if you don't use these decorators.

__constant__ refers to a different hardware resource on the GPU. There are many differences:

  • __constant__ is available on all GPUs, the read-only cache only on cc3.5 and higher
  • memory allocated using the __constant__ tag (which is included on the line to designate the allocation of memory) is limited to a maximum of 64KB. The read-only cache has no such limit. We don't put __restrict__ on a line that allocates memory; it is used to decorate a pointer.
  • data cached in the read-only cache has the typical global memory access considerations - normally we want adjacent and contiguous access for best coalescing of global memory reads through the read-only cache. The __constant__ mechanism, OTOH, expects so-called uniform access for fastest performance. Uniform access essentially means that every thread in a warp is requesting data from the same location/address/index.

Both __constant__ memory, and global memory marked with const decorator on the pointer passed to kernel code, are read-only from the perspective of kernel code.

I don't see any obvious problems in the code you have shown, whether with use of __restrict__ or anything else. The only comment I would have is that for maximal benefit you may want to decorate both the N and P pointers in your kernel declaration/prototype with __restrict__, for maximal benefit, if that is your intent. (You would not decorate P with const, obviously.)

这篇关于CUDA:__ restrict__标签用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆