CUDA动态并行性和全局内存同步 [英] CUDA dynamic parallelism and global memory synchronization

查看:482
本文介绍了CUDA动态并行性和全局内存同步的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法弄清楚以下内容。



如果我启动一个内核,例如,线程 0 在块 0 中,在 __ syncthreads()调用后,所有其他块中的所有其他线程在块 0

$ c
中查看全局内存线程 0 $ b

我的猜测是没有。事实上,在同步功能部分中CUDA C编程指南,它说:


void __syncthreads();
等待,直到线程块中的所有线程已经到达这个点,并且在 __ syncthreads()

http://docs.nvidia.com/cuda/cuda-c-programming-guide/#global-memory =nofollow>动态并行性中的全局内存一致性,CUDA C编程指南声明:


这些修改只有在第二个<$ c $之后才可用于父网格的其他线程c> __ syncthreads()调用。


__ syncthreads



感谢

解决方案

__ syncthreads()执行的唯一操作是由CUDA C编程指南中引用的。除了在多个内核启动中分割内核的执行的简单方法之外,CUDA没有办法在块之间同步,具有所有的性能方面的缺点。因此,你的第一个问题的答案,也是你自己猜到的。



在帖子的第二部分,你指的是CUDA C编程指南,即

  __ global__ void child_launch(int * data){
data [threadIdx.x] = data [threadIdx.x] +1;
}

__global__ void parent_launch(int * data){
data [threadIdx.x] = threadIdx.x;

__syncthreads();

if(threadIdx.x == 0){
child_launch<< 1,256>(数据);
cudaDeviceSynchronize();
}

__syncthreads();
}

void host_launch(int * data){
parent_launch<< 1,256>(数据);
}

这里, 256 parent_launch 内核的线程在 data 中写入一些内容。之后,线程 0 调用 child_launch 。需要第一个 __ syncthreads()来确保所有内存写入在子内核调用之前已完成。引用此点上的指南:


由于第一个 __ syncthreads() ,那么孩子将看到 data [0] = 0 data [1] = 1 ,..., code> data [255] = 255 (没有 __ syncthreads()


关于第二个 __syncthreads(),指南解释


子网格返回时,线程 0 保证看到线程在其子网格中所做的修改。这些修改只有在第二个 __ syncthreads()调用后才可用于父网格的其他线程。


在这个具体的例子中,第二个 __ syncthreads()是多余的,因为由于内核终止,存在一个隐式同步, $ c> __ syncthreads()需要在子内核启动后执行其他操作。



最后,关于您引用的句子您的帖子中:


这些修改只有在第二个 __ syncthreads )调用


请注意,在具体示例中, code> host_launch 函数。这可能有点误导你。



在NVIDIA论坛上有一个有趣的讨论(可能甚至超过一个)关于标题为

$的块之间的线程同步b
$ b

同步CUDA中的所有块


I can't figure out the following.

If I launch a kernel and consider, for example, thread 0 in block 0, after a __syncthreads() call, will all the other threads in all the other blocks see the changes made to global memory by thread 0 in block 0?

My guess is no. Indeed, in the the synchronization functions Section of the CUDA C Programming Guide, it is stated that:

void __syncthreads(); waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

However, when talking about global memory consistency in dynamic parallelism, the CUDA C Programming Guide states that:

Those modifications become available to the other threads of the parent grid only after the second __syncthreads() call.

So does __syncthreads() also makes the changes available across blocks when dynamic parallelism is involved?

Thanks

解决方案

The only action performed by __syncthreads() is that quoted by yourself described in the CUDA C Programming Guide. There is no way in CUDA to synchronize across blocks, apart from the naive approach of dividing the execution of a kernel in multiple kernel launches, with all the drawbacks in terms of performance. Accordingly, the answer to your first question, as also guessed by yourself, is NO.

In the second part of your post, you are referring to a specific example of the CUDA C Programming Guide, namely

__global__ void child_launch(int *data) {
    data[threadIdx.x] = data[threadIdx.x]+1;
}

__global__ void parent_launch(int *data) { 
    data[threadIdx.x] = threadIdx.x;

    __syncthreads();

    if (threadIdx.x == 0) {
        child_launch<<< 1, 256 >>>(data);
        cudaDeviceSynchronize();
    }

    __syncthreads();
}

void host_launch(int *data) {
    parent_launch<<< 1, 256 >>>(data);
}

Here, all the 256 threads of the parent_launch kernel write something in data. After that, thread 0 invokes child_launch. The first __syncthreads() is needed to ensure that all the memory writes have completed before that child kernel invokation. Quoting the guide on this point:

Due to the first __syncthreads() call, the child will see data[0]=0, data[1]=1, ..., data[255]=255 (without the __syncthreads() call, only data[0] would be guaranteed to be seen by the child).

Regarding the second __syncthreads(), the Guide explains that

When the child grid returns, thread 0 is guaranteed to see modifications made by the threads in its child grid. Those modifications become available to the other threads of the parent grid only after the second __syncthreads() call.

In that specific example, the second __syncthreads() is redundant since there is an implicit synchronization due to the kernel termination, but the second __syncthreads() becomes needed when other operations must be performed following the child kernel launch.

Finally, concerning the sentence you are quoting in your post:

Those modifications become available to the other threads of the parent grid only after the second __syncthreads() call

please, note that in the specific example there is only one thread block launched by the host_launch function. This perhaps may have somewhat misled you.

There is an interesting discussion (probably even more than one) on the NVIDIA Forum on thread synchronization across blocks entitled

Synchronize all blocks in CUDA

这篇关于CUDA动态并行性和全局内存同步的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆