CUDA:__ restrict__标签用法 [英] CUDA: __restrict__ tag usage
问题描述
我不太了解CUDA中__restrict__
标记的概念.
我已经了解到,使用__restrict__
可以避免指针混叠,特别是如果指向的变量是只读的,则由于已对其进行了缓存,因此优化了变量的读取.
这是代码的简化版本:
__constant__ float M[M_DIM1][M_DIM2];
__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]);
__global__ void kernel_function(const float* __restrict__ N, float *P);
__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]) {
int IOSize = DIM1 * DIM2 * sizeof(float);
int ConstSize = M_DIM1* M_DIM2* sizeof(float);
float* dN, *dP;
cudaMalloc((void**)&dN, IOSize);
cudaMemcpy(dN, N, IOSize, cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(M, h_M, ConstSize);
cudaMalloc((void**)&dP, IOSize);
dim3 dimBlock(DIM1, DIM2);
dim3 dimGrid(1, 1);
kernel_function << <dimGrid, dimBlock >> >(dN, dP);
cudaMemcpy(P, dP, IOSize, cudaMemcpyDeviceToHost);
cudaFree(dN);
cudaFree(dP);
}
我在N上使用__restrict__
标记是正确的方式吗?
此外,我已经读到M上的关键字__constant__
表示它是只读的并且是常量,那么两者之间的区别是什么,分配类型是什么?
nvcc
使用的 __restrict__
已记录在标准.
__restrict__
是您作为程序员与编译器签订的合同,其中大致表示:我将仅使用此指针来引用基础数据".从编译器的角度来看,这成为表格的关键问题之一就是指针别名,这会阻止编译器进行各种优化.
如果您想要关于restrict
或__restrict__
确切定义的更长篇正式论文,请参考我已经给出的链接之一,或者进行一些研究.
因此,出于优化目的,__restrict__
通常对支持它的编译器很有用.
对于计算能力为3.5或更高版本的设备,这些设备具有称为 GPU上的硬件资源.有很多区别:
-
__constant__
在所有GPU上均可用,只读缓存仅在cc3.5及更高版本上使用
使用 - 内存(包含在该行中以指定内存分配)被限制为最大64KB.只读缓存没有这样的限制.我们不将
__restrict__
放在分配内存的行上.它用于装饰指针.
缓存在只读缓存中的 - 数据具有典型的全局内存访问注意事项-通常,我们希望通过相邻访问和连续访问来最好地合并通过只读缓存读取的全局内存. OTOH是
__constant__
机制,期望所谓的 uniform 访问具有最快的性能.统一访问本质上是指扭曲中的每个线程都从相同的位置/地址/索引请求数据.
__constant__
标记分配的从内核代码的角度来看,__constant__
内存和在传递给内核代码的指针上标记有const
装饰器的全局内存都是只读的.
无论使用__restrict__
还是其他方法,在显示的代码中我都看不到任何明显的问题.我唯一要说的是,为了获得最大利益,您可能想用__restrict__
装饰内核声明/原型中的N
和P
指针,以获得最大利益,如果这是您的意图. (显然,您不会用const
装饰P
.)
I don't quite understand the concept of the __restrict__
tag in CUDA.
I've read that using __restrict__
avoids pointers aliasing and in particular, if the variable pointed at is read-only, the reading of the variable is optimized because it's cached.
This is a simplified version of the code:
__constant__ float M[M_DIM1][M_DIM2];
__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]);
__global__ void kernel_function(const float* __restrict__ N, float *P);
__host__ void function(float N[][DIM2], float h_M[][M_DIM2], float P[][DIM2]) {
int IOSize = DIM1 * DIM2 * sizeof(float);
int ConstSize = M_DIM1* M_DIM2* sizeof(float);
float* dN, *dP;
cudaMalloc((void**)&dN, IOSize);
cudaMemcpy(dN, N, IOSize, cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(M, h_M, ConstSize);
cudaMalloc((void**)&dP, IOSize);
dim3 dimBlock(DIM1, DIM2);
dim3 dimGrid(1, 1);
kernel_function << <dimGrid, dimBlock >> >(dN, dP);
cudaMemcpy(P, dP, IOSize, cudaMemcpyDeviceToHost);
cudaFree(dN);
cudaFree(dP);
}
Am I using the __restrict__
tag on N, that is read-only, in the right way?
Furthermore, I've read that the keyword __constant__
on M means that is read-only and constant, so what is the difference between the two of them, the type of allocation?
__restrict__
as used by nvcc
is documented here. (note that various c++ compilers including gnu compilers also have support for this exact keyword, and use it similarly).
It has essentially the same semantics as the C99 restrict
keyword, which is an official part of that language standard.
In a nutshell, __restrict__
is a contract that you as a programmer make with the compiler, which says, roughly, "I will only use this pointer to refer to the underlying data". One of the key things that this takes off the table from the compiler's perspective is pointer aliasing, which can prevent the compiler from being able to make various optimizations.
If you'd like a longer formal treatise on the exact definition of restrict
or __restrict__
, please refer to one of the links I've already given, or do some research.
So, __restrict__
is generally useful to compilers that support it, for optimization purposes.
For compute capability 3.5 or higher devices, these devices have a separate cache called the read only cache which is independent of normal L1 type caching.
If you use both __restrict__
and const
to decorate global pointers passed to a kernel, then this is also a strong hint to the compiler, when generating code for cc3.5 and higher devices, to cause those global memory loads to flow through the read-only cache. This can provide application performance benefits, often with little other code refactoring. This doesn't guarantee usage of the read-only cache, and the compiler will often attempt to aggressively use the read only cache if it can satisfy the necessary conditions, even if you don't use these decorators.
__constant__
refers to a different hardware resource on the GPU. There are many differences:
__constant__
is available on all GPUs, the read-only cache only on cc3.5 and higher- memory allocated using the
__constant__
tag (which is included on the line to designate the allocation of memory) is limited to a maximum of 64KB. The read-only cache has no such limit. We don't put__restrict__
on a line that allocates memory; it is used to decorate a pointer. - data cached in the read-only cache has the typical global memory access considerations - normally we want adjacent and contiguous access for best coalescing of global memory reads through the read-only cache. The
__constant__
mechanism, OTOH, expects so-called uniform access for fastest performance. Uniform access essentially means that every thread in a warp is requesting data from the same location/address/index.
Both __constant__
memory, and global memory marked with const
decorator on the pointer passed to kernel code, are read-only from the perspective of kernel code.
I don't see any obvious problems in the code you have shown, whether with use of __restrict__
or anything else. The only comment I would have is that for maximal benefit you may want to decorate both the N
and P
pointers in your kernel declaration/prototype with __restrict__
, for maximal benefit, if that is your intent. (You would not decorate P
with const
, obviously.)
这篇关于CUDA:__ restrict__标签用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!