而在CUDA内核中循环失败 [英] While loop fails in CUDA kernel
问题描述
我使用GPU做一些计算处理字。
最初,我使用一个块(有500个线程)来处理一个字。
要处理100个单词,我必须在我的主函数中循环内核函数100次。
I am using GPU to do some calculation for processing words. Initially, I used one block (with 500 threads) to process one word. To process 100 words, I have to loop the kernel function 100 times in my main function.
for (int i=0; i<100; i++)
kernel <<< 1, 500 >>> (length_of_word);
我的内核函数看起来像这样:
My kernel function looks like this:
__global__ void kernel (int *dev_length)
{
int length = *dev_length;
while (length > 4)
{ //do something;
length -=4;
}
}
现在我想处理所有100个字时间。
Now I want to process all 100 words at the same time.
每个块仍将有500个线程,并处理一个字(每个块)。
Each block will still have 500 threads, and processes one word (per block).
dev_totalwordarray:store所有字符的字符(一个接一个)
dev_totalwordarray: store all characters of the words (one after another)
dev_length_array:存储每个字的长度。
dev_length_array: store the length of each word.
dev_accu_length :存储字的累积长度(所有以前字的总字符)
dev_accu_length: stores the accumulative length of the word (total char of all previous words)
dev_salt_是大小为500的数组,存储无符号整数。
dev_salt_ is an array of of size 500, storing unsigned integers.
因此,在我的主要函数中,我有
Hence, in my main function I have
kernel2 <<< 100, 500 >>> (dev_totalwordarray, dev_length_array, dev_accu_length, dev_salt_);
填充cpu数组:
for (int i=0; i<wordnumber; i++)
{
int length=0;
while (word_list_ptr_array[i][length]!=0)
{
length++;
}
actualwordlength2[i] = length;
}
要从cpu - > gpu:
to copy from cpu -> gpu:
int* dev_array_of_word_length;
HANDLE_ERROR( cudaMalloc( (void**)&dev_array_of_word_length, 100 * sizeof(int) ) );
HANDLE_ERROR( cudaMemcpy( dev_array_of_word_length, actualwordlength2, 100 * sizeof(int),
kernel现在看起来像这样:
My function kernel now looks like this:
__global__ void kernel2 (char* dev_totalwordarray, int *dev_length_array, int* dev_accu_length, unsigned int* dev_salt_)
{
tid = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int hash[N];
int length = dev_length_array[blockIdx.x];
while (tid < 50000)
{
const char* itr = &(dev_totalwordarray[dev_accu_length[blockIdx.x]]);
hash[tid] = dev_salt_[threadIdx.x];
unsigned int loop = 0;
while (length > 4)
{ const unsigned int& i1 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
const unsigned int& i2 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
hash[tid] ^= (hash[tid] << 7) ^ i1 * (hash[tid] >> 3) ^ (~((hash[tid] << 11) + (i2 ^ (hash[tid] >> 5))));
length -=4;
}
tid += blockDim.x * gridDim.x;
}
}
然而,kernel2似乎不工作。
However, kernel2 doesn't seem to work at all.
看起来 while(length> 4)
会导致此情况。
有没有人知道为什么?感谢。
Does anyone know why? Thanks.
推荐答案
我不确定 while
是罪魁祸首,但我看到你的代码中有很多东西让我担心:
I am not sure if the while
is the culprit, but I see few things in your code that worry me:
- 你的内核没有输出。优化器很可能会检测到这一点并将其转换为空内核。
- 在几乎没有任何情况下,你希望每个线程分配数组。这将消耗大量的内存。您的
hash [N]
表将在每个线程分配,并在内核结束时丢弃。如果N
很大(然后乘以线程总数),则可能会耗尽GPU内存。更何况,访问hash
将几乎和访问全局内存一样慢。 - 一个块中的所有线程都有相同
itr
值。 - 每个线程只会初始化自己的
hash
表中的单个字段。 - 我看到
hash [tid]
其中tid
是一个全局索引。请注意,即使hash
是全局的,您也可能会遇到并发问题。不是网格内的所有块都会同时运行。虽然一个块会初始化hash
的一部分,但另一个块可能甚至无法启动!
- Your kernel produces no output. The optimizer will most likely detect this and convert it to an empty kernel
- In almost no situation you want arrays allocated per-thread. That will consume a lot of memory. Your
hash[N]
table will be allocated per-thread and discarded at the end of the kernel. IfN
is big (and then multiplied by the total amount of threads) you may run out of GPU memory. Not to mention, that accessing thehash
will be almost as slow as accessing global memory. - All threads in a block will have the same
itr
value. Is it intended? - Every thread initializes only a single field within its own copy of
hash
table. - I see
hash[tid]
wheretid
is a global index. Be aware that even ifhash
was made global, you may hit concurrency problems. Not all blocks within a grid will run at the same time. While one block will initialize a portion ofhash
, another block might not even start!
这篇关于而在CUDA内核中循环失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!