CUDA:将相同的内存位置流式传输到所有线程 [英] CUDA: streaming the same memory location to all threads

查看:171
本文介绍了CUDA:将相同的内存位置流式传输到所有线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的问题:我有很多双打(一组77.500个双打)存储在cuda中的某个位置。现在,我需要大量线程来对该数组顺序执行一堆操作。每个线程将必须读取该数组的SAME元素,执行任务,将结果存储在共享内存中以及读取该数组的下一个元素。请注意,每个线程将必须同时从同一内存位置读取(仅读取)。所以我想知道:有什么方法可以只读取一次内存就向所有线程广播相同的double吗?多次阅读将毫无用处...知道吗?

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??

推荐答案

这是常见的优化方法。想法是使每个线程与其块伙伴协作以读取数据:

This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:

// choose some reasonable block size
const unsigned int block_size = 256;

__global__ void kernel(double *ptr)
{
  __shared__ double window[block_size];

  // cooperate with my block to load block_size elements
  window[threadIdx.x] = ptr[threadIdx.x];

  // wait until the window is full
  __syncthreads();

  // operate on the data
  ...
}

您可以迭代地滑动窗口,使其跨数组 block_size (或者可能还有一些整数倍)元素,以消耗整个对象。当您想以同步方式将数据存储回去时,可以应用相同的技术。

You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.

这篇关于CUDA:将相同的内存位置流式传输到所有线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆