CUDA动态并行性和全局内存同步 [英] CUDA dynamic parallelism and global memory synchronization
问题描述
我无法弄清楚以下内容。
如果我启动一个内核,例如,线程 0
在块 0
中,在 __ syncthreads()
调用后,所有其他块中的所有其他线程在块 0
中查看全局内存线程
0
$ b 我的猜测是没有。事实上,在同步功能部分中CUDA C编程指南,它说:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#global-memory =nofollow>动态并行性中的全局内存一致性,CUDA C编程指南声明:
void __syncthreads();
等待,直到线程块中的所有线程已经到达这个点,并且在__ syncthreads()$ c $之前的所有全局和共享内存访问
这些修改只有在第二个<$ c $之后才可用于父网格的其他线程c> __ syncthreads()调用。
__ syncthreads
感谢
解决方案
__ syncthreads()
执行的唯一操作是由CUDA C编程指南中引用的。除了在多个内核启动中分割内核的执行的简单方法之外,CUDA没有办法在块之间同步,具有所有的性能方面的缺点。因此,你的第一个问题的答案,也是你自己猜到的。
在帖子的第二部分,你指的是CUDA C编程指南,即
__ global__ void child_launch(int * data){
data [threadIdx.x] = data [threadIdx.x] +1;
}
__global__ void parent_launch(int * data){
data [threadIdx.x] = threadIdx.x;
__syncthreads();
if(threadIdx.x == 0){
child_launch<< 1,256>(数据);
cudaDeviceSynchronize();
}
__syncthreads();
}
void host_launch(int * data){
parent_launch<< 1,256>(数据);
}
这里,
256
parent_launch
内核的线程在data
中写入一些内容。之后,线程0
调用child_launch
。需要第一个__ syncthreads()
来确保所有内存写入在子内核调用之前已完成。引用此点上的指南:
由于第一个
__ syncthreads()
,那么孩子将看到data [0] = 0
,data [1] = 1
,..., code> data [255] = 255 (没有__ syncthreads()
关于第二个
__syncthreads()
,指南解释
子网格返回时,线程
0
保证看到线程在其子网格中所做的修改。这些修改只有在第二个__ syncthreads()
调用后才可用于父网格的其他线程。
在这个具体的例子中,第二个
__ syncthreads()
是多余的,因为由于内核终止,存在一个隐式同步, $ c> __ syncthreads()需要在子内核启动后执行其他操作。
最后,关于您引用的句子您的帖子中:
这些修改只有在第二个
__ syncthreads )
调用
请注意,在具体示例中, code> host_launch 函数。这可能有点误导你。
在NVIDIA论坛上有一个有趣的讨论(可能甚至超过一个)关于标题为
$的块之间的线程同步b
$ bI can't figure out the following.
If I launch a kernel and consider, for example, thread
0
in block0
, after a__syncthreads()
call, will all the other threads in all the other blocks see the changes made to global memory by thread0
in block0
?My guess is no. Indeed, in the the synchronization functions Section of the CUDA C Programming Guide, it is stated that:
void __syncthreads();
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to__syncthreads()
are visible to all threads in the block.However, when talking about global memory consistency in dynamic parallelism, the CUDA C Programming Guide states that:
Those modifications become available to the other threads of the parent grid only after the second
__syncthreads()
call.So does
__syncthreads()
also makes the changes available across blocks when dynamic parallelism is involved?Thanks
解决方案The only action performed by
__syncthreads()
is that quoted by yourself described in the CUDA C Programming Guide. There is no way in CUDA to synchronize across blocks, apart from the naive approach of dividing the execution of a kernel in multiple kernel launches, with all the drawbacks in terms of performance. Accordingly, the answer to your first question, as also guessed by yourself, is NO.In the second part of your post, you are referring to a specific example of the CUDA C Programming Guide, namely
__global__ void child_launch(int *data) { data[threadIdx.x] = data[threadIdx.x]+1; } __global__ void parent_launch(int *data) { data[threadIdx.x] = threadIdx.x; __syncthreads(); if (threadIdx.x == 0) { child_launch<<< 1, 256 >>>(data); cudaDeviceSynchronize(); } __syncthreads(); } void host_launch(int *data) { parent_launch<<< 1, 256 >>>(data); }
Here, all the
256
threads of theparent_launch
kernel write something indata
. After that, thread0
invokeschild_launch
. The first__syncthreads()
is needed to ensure that all the memory writes have completed before that child kernel invokation. Quoting the guide on this point:Due to the first
__syncthreads()
call, the child will seedata[0]=0
,data[1]=1
, ...,data[255]=255
(without the__syncthreads()
call, onlydata[0]
would be guaranteed to be seen by the child).Regarding the second
__syncthreads()
, the Guide explains thatWhen the child grid returns, thread
0
is guaranteed to see modifications made by the threads in its child grid. Those modifications become available to the other threads of the parent grid only after the second__syncthreads()
call.In that specific example, the second
__syncthreads()
is redundant since there is an implicit synchronization due to the kernel termination, but the second__syncthreads()
becomes needed when other operations must be performed following the child kernel launch.Finally, concerning the sentence you are quoting in your post:
Those modifications become available to the other threads of the parent grid only after the second
__syncthreads()
callplease, note that in the specific example there is only one thread block launched by the
host_launch
function. This perhaps may have somewhat misled you.There is an interesting discussion (probably even more than one) on the NVIDIA Forum on thread synchronization across blocks entitled
Synchronize all blocks in CUDA
这篇关于CUDA动态并行性和全局内存同步的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!