CUDA同步和读取全局内存 [英] CUDA synchronization and reading global memory

查看:124
本文介绍了CUDA同步和读取全局内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的:

  __ global__ void globFunction(int * arr,int N){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
//计算和写结果到arr ...
__syncthreads();
//读取另一个线程的值(ex i + 1)
int val = arr [idx + 1]; // it is givling old value
}


int main(){
//声明数组,分配内存,复制内存等
globFunction<< 4000,256>(arr,N);
// do something ...
return 0;
}

为什么我读取 arr [idx + 1] ?我调用 __ syncthreads ,所以我希望看到更新的值。我做错了什么? 使用 __ syncthreads()函数仅同步当前块中的线程。在这种情况下,这将是您在启动内核时创建的每个块256个线程。因此,在给定的数组中,对于每个索引值,它将跨越到另一个线程块,你最终会从全局内存中读取一个与当前块中的线程不同步的值。



您可以使用 __ shared __ CUDA指令来创建共享线程本地存储,以允许您的块中的线程以在它们之间共享信息,但防止来自其它块的线程访问分配给当前块的存储器。一旦块中的计算完成(您可以使用 __ syncthreads()执行此任务),则可以将共享块中的值复制回全局可访问的内存



您的内核可能类似于:

  __global__ void globFunction(int * arr,int N)
{
__shared__ int local_array [THREADS_PER_BLOCK]; // local block memory cache
int idx = blockIdx.x * blockDim.x + threadIdx.x;

//...calculate results
local_array [threadIdx.x] = results;

//同步本地线程写入本地内存缓存
__syncthreads();

//读取当前线程中另一个线程的结果
int val = local_array [(threadIdx.x + 1)%THREADS_PER_BLOCK];

//将值写回全局内存
arr [idx] = val;
}



如果你必须同步线程,你应该寻找另一种方法解决你的问题,因为CUDA编程模型最有效地工作,当一个问题可以分解为块,并且线程同步只需要发生在块内。


I have something like this:

__global__ void globFunction(int *arr, int N) {
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;
    // calculating and Writing results to arr ...
    __syncthreads();
    // reading values of another threads(ex i+1)
    int val = arr[idx+1]; // IT IS GIVING OLD VALUE
}


int main() {
    // declare array, alloc memory, copy memory, etc.
    globFunction<<< 4000, 256>>>(arr, N); 
    // do something ...
    return 0;
}

Why am I getting the old value when I read arr[idx+1]? I called __syncthreads, so I expect to see the updated value. What did I do wrong? Am I reading a cache or what?

解决方案

Using the __syncthreads() function only synchronizes the threads in the current block. In this case this would be the 256 threads per block you created when you launched the kernel. So in your given array, for each index value that crosses over into another block of threads, you'll end up reading a value from global memory that is not synchronized with respect to the threads in the current block.

One thing you can do to circumvent this issue is create shared thread-local storage using the __shared__ CUDA directive that allows the threads in your blocks to share information among themselves, but prevents threads from other blocks accessing the memory allocated for the current block. Once your calculation within the block is complete (and you can use __syncthreads() for this task), you can then copy back into the globally accessible memory the values in the shared block-level storage.

Your kernel could look something like:

__global__ void globFunction(int *arr, int N) 
{
    __shared__ int local_array[THREADS_PER_BLOCK];  //local block memory cache           
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;

    //...calculate results
    local_array[threadIdx.x] = results;

    //synchronize the local threads writing to the local memory cache
    __syncthreads();

    // read the results of another thread in the current thread
    int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];

    //write back the value to global memory
    arr[idx] = val;        
}

If you must synchronize threads across blocks, you should be looking for another way to solve your problem, since the CUDA programing model works most effectively when a problem can be broken down into blocks, and threads synchronization only needs to take place within a block.

这篇关于CUDA同步和读取全局内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆