Cuda内核与减少 - 逻辑错误的点积的2矩阵 [英] Cuda Kernel with reduction - logic errors for dot product of 2 matrices

查看：341 发布时间：2017/3/5 19:16:37 cuda gpu gpu-programming reduction

本文介绍了Cuda内核与减少 - 逻辑错误的点积的2矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我刚刚开始与CUDA，我试图包装我的大脑的CUDA减少算法。在我的情况下，我一直在试图得到两个矩阵的点积。但我得到了正确的答案只有矩阵2。对于任何其他大小的矩阵，我错了。

这只是测试，所以我保持矩阵大小很小。只有大约100所以只有1块将适合所有。
任何帮助将非常感谢。非常感谢！

以下是正式代码

  float * ha = new float [n]; // matrix a 
 float * hb = new float [n]; // matrix b 
 float * hc = new float [1]; // sum of a.b 
 
 float dx = hc [0]; 
 float hx = 0; 
 // dot product 
 for（int i = 0; i  hx + = ha [i] * hb [i]

这是我的cuda内核

  __ global__ void sum_reduce（float * da，float * db，float * dc，int n）
 {
 int tid = threadIdx.x; 
 dc [tid] = 0; 
 for（int stride = 1; stride< n; stride * = 2）{
 if（tid％（2 * stride）== 0）
 dc [tid] + = da [tid] * db [tid]）+（da [tid + stride] * db [tid + stride]）; 
 __syncthreads（）; 
} 
}

我的完整代码： http://pastebin.com/zS85URX5

解决方案

<希望你能找出为什么它适用于n = 2的情况，所以让我们跳过，看看为什么它失败了一些其他情况，让我们选择n = 4。当你的for循环的第一次迭代，stride = 1，所以通过的线程，如果测试是线程0和2。

 线程0：dc [0] + = da [0] * db [0] + da [1] * db [1]; 
 thread 2：dc [2] + = da [2] * db [2] + da [3] * db [3]

到目前为止很好。在for循环的第二次迭代中，stride为2，因此通过if测试的线程为线程0（仅）。

 线程0：dc [0] + = da [0] * db [0] + da [2] * db [2]

但这没有意义，并不是我们想要的。我们想要的是：

  dc [0] + = dc [2]

我花了一点时间试图想想如何解决这个只是几个步骤，但它只是没有意义，我作为一个减少。如果你用这个代码替换你的内核代码，我想你会有好的结果。它不是很像你的代码，但它是最接近我可以来的东西，将适用于所有的情况下，你设想（即n

  // CUDA内核代码
 __global__ void sum_reduce（float * da，float * db，float * dc，int n）
 {
 int tid = threadIdx.x; 
 //对于线程的全宽，并行进行乘法运算
 dc [tid] = da [tid] * db [tid]; 
 //等待所有线程完成乘法步骤
 __syncthreads（）; 
 int stride = blockDim.x; 
 while（stride> 1）{
 // handle odd step 
 if（（stride& 1）&&（tid == 0））dc [0] + = dc [stride-1]; 
 //依次将问题除以2 
 stride>> = 1; 
 //将每个上半元素添加到每个下半元素
 if（tid< stride）dc [tid] + = dc [tid + stride] 
 //等待所有线程完成添加步骤
 __syncthreads（）; 
} 
}

注意，我不是真的使用 n 参数。由于您使用 n 线程启动内核，所以 blockDim.x 内置变量等于 n 。

I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong.

This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all. Any help would be greatly appreciated. Thanks!

Here is the regular code

float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b

float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
    hx += ha[i] * hb[i];

Here is my cuda kernel

__global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     dc[tid] = 0;
     for (int stride = 1; stride < n; stride *= 2) {
         if (tid % (2 * stride) == 0)
                 dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
         __syncthreads();
     }
 }

My complete code : http://pastebin.com/zS85URX5

解决方案

Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.

In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.

thread 0:   dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2:   dc[2] += da[2]*db[2] + da[3]*db[3];

So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).

thread 0:   dc[0] += da[0]*db[0] + da[2]*db[2];

But this doesn't make sense and is not what we want at all. What we want is something like:

dc[0] += dc[2];

So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):

 // CUDA kernel code
 __global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     // do multiplication in parallel for full width of threads
     dc[tid] = da[tid] * db[tid];
     // wait for all threads to complete multiply step
     __syncthreads();
     int stride = blockDim.x;
     while (stride > 1){
       // handle odd step
       if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
       // successively divide problem by 2
       stride >>= 1;
       // add each upper half element to each lower half element
       if (tid < stride) dc[tid] += dc[tid + stride];
       // wait for all threads to complete add step
       __syncthreads();
       }
 }

Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.

这篇关于Cuda内核与减少 - 逻辑错误的点积的2矩阵的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Cuda内核与减少 - 逻辑错误的点积的2矩阵 [英] Cuda Kernel with reduction - logic errors for dot product of 2 matrices

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

Cuda内核与减少 - 逻辑错误的点积的2矩阵 [英] Cuda Kernel with reduction - logic errors for dot product of 2 matrices

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭