Cuda内核与减少 - 逻辑错误的点积的2矩阵 [英] Cuda Kernel with reduction - logic errors for dot product of 2 matrices

查看:341
本文介绍了Cuda内核与减少 - 逻辑错误的点积的2矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始与CUDA,我试图包装我的大脑的CUDA减少算法。在我的情况下,我一直在试图得到两个矩阵的点积。但我得到了正确的答案只有矩阵2。对于任何其他大小的矩阵,我错了。



这只是测试,所以我保持矩阵大小很小。只有大约100所以只有1块将适合所有。
任何帮助将非常感谢。非常感谢!



以下是正式代码

  float * ha = new float [n]; // matrix a 
float * hb = new float [n]; // matrix b
float * hc = new float [1]; // sum of a.b

float dx = hc [0];
float hx = 0;
// dot product
for(int i = 0; i hx + = ha [i] * hb [i]

这是我的cuda内核

  __ global__ void sum_reduce(float * da,float * db,float * dc,int n)
{
int tid = threadIdx.x;
dc [tid] = 0;
for(int stride = 1; stride< n; stride * = 2){
if(tid%(2 * stride)== 0)
dc [tid] + = da [tid] * db [tid])+(da [tid + stride] * db [tid + stride]);
__syncthreads();
}
}

我的完整代码: http://pastebin.com/zS85URX5

解决方案

<希望你能找出为什么它适用于n = 2的情况,所以让我们跳过,看看为什么它失败了一些其他情况,让我们选择n = 4。当你的for循环的第一次迭代,stride = 1,所以通过的线程,如果测试是线程0和2。

 线程0:dc [0] + = da [0] * db [0] + da [1] * db [1]; 
thread 2:dc [2] + = da [2] * db [2] + da [3] * db [3]

到目前为止很好。在for循环的第二次迭代中,stride为2,因此通过if测试的线程为线程0(仅)。

 线程0:dc [0] + = da [0] * db [0] + da [2] * db [2] 

但这没有意义,并不是我们想要的。我们想要的是:

  dc [0] + = dc [2] 

我花了一点时间试图想想如何解决这个只是几个步骤,但它只是没有意义,我作为一个减少。如果你用这个代码替换你的内核代码,我想你会有好的结果。它不是很像你的代码,但它是最接近我可以来的东西,将适用于所有的情况下,你设想(即n

  // CUDA内核代码
__global__ void sum_reduce(float * da,float * db,float * dc,int n)
{
int tid = threadIdx.x;
//对于线程的全宽,并行进行乘法运算
dc [tid] = da [tid] * db [tid];
//等待所有线程完成乘法步骤
__syncthreads();
int stride = blockDim.x;
while(stride> 1){
// handle odd step
if((stride& 1)&&(tid == 0))dc [0] + = dc [stride-1];
//依次将问题除以2
stride>> = 1;
//将每个上半元素添加到每个下半元素
if(tid< stride)dc [tid] + = dc [tid + stride]
//等待所有线程完成添加步骤
__syncthreads();
}
}

注意,我不是真的使用 n 参数。由于您使用 n 线程启动内核,所以 blockDim.x 内置变量等于 n


I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong.

This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all. Any help would be greatly appreciated. Thanks!

Here is the regular code

float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b

float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
    hx += ha[i] * hb[i];

Here is my cuda kernel

__global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     dc[tid] = 0;
     for (int stride = 1; stride < n; stride *= 2) {
         if (tid % (2 * stride) == 0)
                 dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
         __syncthreads();
     }
 }

My complete code : http://pastebin.com/zS85URX5

解决方案

Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.

In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.

thread 0:   dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2:   dc[2] += da[2]*db[2] + da[3]*db[3];

So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).

thread 0:   dc[0] += da[0]*db[0] + da[2]*db[2]; 

But this doesn't make sense and is not what we want at all. What we want is something like:

dc[0] += dc[2];

So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):

 // CUDA kernel code
 __global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     // do multiplication in parallel for full width of threads
     dc[tid] = da[tid] * db[tid];
     // wait for all threads to complete multiply step
     __syncthreads();
     int stride = blockDim.x;
     while (stride > 1){
       // handle odd step
       if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
       // successively divide problem by 2
       stride >>= 1;
       // add each upper half element to each lower half element
       if (tid < stride) dc[tid] += dc[tid + stride];
       // wait for all threads to complete add step
       __syncthreads();
       }
 }

Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.

这篇关于Cuda内核与减少 - 逻辑错误的点积的2矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆