编码一个有多个线程写入同一个索引的CUDA内核? [英] Coding a CUDA Kernel that has many threads writing to the same index?

查看:161
本文介绍了编码一个有多个线程写入同一个索引的CUDA内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在写一些代码来激活CUDA上的神经网络,我遇到了一个问题。我没有得到正确的加权进入一个给定的神经元的总和。



这里是内核代码,我将尝试用变量更清楚地解释它。

  __ global__ void kernelSumWeights(float * sumArray,float * weightArray,int2 * sourceTargetArray,int cLength)
{
int nx = threadIdx.x + TILE_WIDTH * threadIdx.y;
int index_in =(blockIdx.x + gridDim.x * blockIdx.y)* TILE_WIDTH * TILE_WIDTH + nx;
if(index_in< cLength)
{

sumArray [sourceTargetArray [index_in] .y] + = fabs(weightArray [index_in]);
// __ threadfence();
__threadfence_block();

}

}

网络中的连接数为 cLength 。对于每个连接,都有源神经元和目标神经元,以及该连接的权重。 SourceTargetArray 包含该信息。因此, sourceTargetArray 的索引 i 是连接的源神经元索引 i ,以及连接的目标神经元索引 i weightArray 包含重量信息(因此索引 i weightArray 对应于连接 i )。



正如你所看到的, SumArray 是我存储和的地方。因此,内核将 sumArray (连接的目标神经元索引 i )增加连接权重的绝对值 i 。直观地,对于到神经元的所有输入连接,对所有权重求和。这真的是我想要做的这个内核。最后,我将使用这个和来归一化权重。



问题是它的错误。我已经做了这个顺序,答案是不同的。答案不同,通常大约12-15x(所以正确的答案将是700.0,我得到的是在50s范围内的东西)。



您可以看到我添加了 __ threadfence()(和 __ threadfence_block code>,以确保写入不是由每个线程同时完成)。我不知道这是否是我的代码的问题。我已经确保权重数组与我测试的串行版本相同,并且源/目标信息也是相同的。我在做错了什么?



编辑:为了参考,CUDA编程指南中描述了 __ threadfence() v3.1附录B.5内存围栏函数

解决方案

你需要做一个简化。



总计分配给每个线程的元素,并将结果存放在一个数组中:cache [threadsPerBlock] then __Syncthreads



现在通过添加连续的相邻小计来减少生成的小计:

  int cacheIndex = threadIdx.x; 
int i = blockDim.x / 2;
while(i!= 0)
{
if(cacheIndex cache [cacheIndex] + = cache [cacheIndex] + 1;
__syncthreads;
i / = 2;
}
}

p>

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf



这里的示例代码如下:



http://www.nvidia.com/object/cuda_sample_data-parallel.html



在CUDA BY Example中也有很好的解释, (这是代码片段来自哪里)。



这种方法有一个大的警告。添加不会按照与序列代码相同的顺序发生。浮动的添加不是交换的,所以舍入误差可能导致稍微不同的结果。


I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.

So here is the kernel code, and I'll try to explain it a bit clearer with the variables.

__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
 if(index_in < cLength)
 {

  sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
  //__threadfence();
  __threadfence_block();

 }

}

First off, the number of connections in the network is cLength. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray contains that information. So index i of sourceTargetArray is the source neuron index of connection i, and target neuron index of connection i. The weightArray contains the weight information (so index i of weightArray corresponds to connection i).

As you can see, SumArray is where I'm storing the sums. So kernel increments the sumArray (at target neuron index of connection i) by the absolute value of the weight of connection i. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.

The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).

You can see that I added __threadfence() (and __threadfence_block() in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?

EDIT: For reference, __threadfence() usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions

解决方案

You need to do a reduction.

Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads

Now reduce the resulting sub totals by adding successive neighboring subtotals:

int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
    if (cacheIndex < i)
        cache[cacheIndex] += cache[cacheIndex] + 1;
        __syncthreads;
        i /= 2;
    }
}

The following deck explains this in some detail:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

Sample code for this is here:

http://www.nvidia.com/object/cuda_sample_data-parallel.html

It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).

There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.

这篇关于编码一个有多个线程写入同一个索引的CUDA内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆