而在CUDA内核中循环失败 [英] While loop fails in CUDA kernel

查看:165
本文介绍了而在CUDA内核中循环失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用GPU做一些计算处理字。
最初,我使用一个块(有500个线程)来处理一个字。
要处理100个单词,我必须在我的主函数中循环内核函数100次。

I am using GPU to do some calculation for processing words. Initially, I used one block (with 500 threads) to process one word. To process 100 words, I have to loop the kernel function 100 times in my main function.

for (int i=0; i<100; i++)
    kernel <<< 1, 500 >>> (length_of_word); 

我的内核函数看起来像这样:

My kernel function looks like this:

__global__ void kernel (int *dev_length)
{
   int length = *dev_length;
   while (length > 4)
   {   //do something;
          length -=4;
   }
}

现在我想处理所有100个字时间。

Now I want to process all 100 words at the same time.

每个块仍将有500个线程,并处理一个字(每个块)。

Each block will still have 500 threads, and processes one word (per block).

dev_totalwordarray:store所有字符的字符(一个接一个)

dev_totalwordarray: store all characters of the words (one after another)

dev_length_array:存储每个字的长度。

dev_length_array: store the length of each word.

dev_accu_length :存储字的累积长度(所有以前字的总字符)

dev_accu_length: stores the accumulative length of the word (total char of all previous words)

dev_salt_是大小为500的数组,存储无符号整数。

dev_salt_ is an array of of size 500, storing unsigned integers.

因此,在我的主要函数中,我有

Hence, in my main function I have

   kernel2 <<< 100, 500 >>> (dev_totalwordarray, dev_length_array, dev_accu_length, dev_salt_);

填充cpu数组:

    for (int i=0; i<wordnumber; i++)
    {
        int length=0;
        while (word_list_ptr_array[i][length]!=0)
        {
            length++;
        }

        actualwordlength2[i] = length;
    }

要从cpu - > gpu:

to copy from cpu -> gpu:

    int* dev_array_of_word_length;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_array_of_word_length, 100 * sizeof(int) ) );
    HANDLE_ERROR( cudaMemcpy( dev_array_of_word_length, actualwordlength2, 100 * sizeof(int),

kernel现在看起来像这样:

My function kernel now looks like this:

__global__ void kernel2 (char* dev_totalwordarray, int *dev_length_array, int* dev_accu_length, unsigned int* dev_salt_)
{

  tid = threadIdx.x + blockIdx.x * blockDim.x;
  unsigned int hash[N];

  int length = dev_length_array[blockIdx.x];

   while (tid < 50000)
   {
        const char* itr = &(dev_totalwordarray[dev_accu_length[blockIdx.x]]);
        hash[tid] = dev_salt_[threadIdx.x];
        unsigned int loop = 0;

        while (length > 4)
        {   const unsigned int& i1 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
            const unsigned int& i2 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
            hash[tid] ^= (hash[tid] <<  7) ^  i1 * (hash[tid] >> 3) ^ (~((hash[tid] << 11) + (i2 ^ (hash[tid] >> 5))));
            length -=4;
        }
        tid += blockDim.x * gridDim.x;
   }
}

然而,kernel2似乎不工作。

However, kernel2 doesn't seem to work at all.

看起来 while(length> 4)会导致此情况。

有没有人知道为什么?感谢。

Does anyone know why? Thanks.

推荐答案

我不确定 while 是罪魁祸首,但我看到你的代码中有很多东西让我担心:

I am not sure if the while is the culprit, but I see few things in your code that worry me:


  • 你的内核没有输出。优化器很可能会检测到这一点并将其转换为空内核。

  • 在几乎没有任何情况下,你希望每个线程分配数组。这将消耗大量的内存。您的 hash [N] 表将在每个线程分配,并在内核结束时丢弃。如果 N 很大(然后乘以线程总数),则可能会耗尽GPU内存。更何况,访问 hash 将几乎和访问全局内存一样慢。

  • 一个块中的所有线程都有相同 itr 值。

  • 每个线程只会初始化自己的 hash 表中的单个字段。

  • 我看到 hash [tid] 其中 tid 是一个全局索引。请注意,即使 hash 是全局的,您也可能会遇到并发问题。不是网格内的所有块都会同时运行。虽然一个块会初始化 hash 的一部分,但另一个块可能甚至无法启动!

  • Your kernel produces no output. The optimizer will most likely detect this and convert it to an empty kernel
  • In almost no situation you want arrays allocated per-thread. That will consume a lot of memory. Your hash[N] table will be allocated per-thread and discarded at the end of the kernel. If N is big (and then multiplied by the total amount of threads) you may run out of GPU memory. Not to mention, that accessing the hash will be almost as slow as accessing global memory.
  • All threads in a block will have the same itr value. Is it intended?
  • Every thread initializes only a single field within its own copy of hash table.
  • I see hash[tid] where tid is a global index. Be aware that even if hash was made global, you may hit concurrency problems. Not all blocks within a grid will run at the same time. While one block will initialize a portion of hash, another block might not even start!

这篇关于而在CUDA内核中循环失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆