Cuda内核与减少 - 逻辑错误的点积的2矩阵 [英] Cuda Kernel with reduction - logic errors for dot product of 2 matrices
问题描述
我刚刚开始与CUDA,我试图包装我的大脑的CUDA减少算法。在我的情况下,我一直在试图得到两个矩阵的点积。但我得到了正确的答案只有矩阵2。对于任何其他大小的矩阵,我错了。
这只是测试,所以我保持矩阵大小很小。只有大约100所以只有1块将适合所有。
任何帮助将非常感谢。非常感谢!
以下是正式代码
float * ha = new float [n]; // matrix a
float * hb = new float [n]; // matrix b
float * hc = new float [1]; // sum of a.b
float dx = hc [0];
float hx = 0;
// dot product
for(int i = 0; i hx + = ha [i] * hb [i]
这是我的cuda内核
__ global__ void sum_reduce(float * da,float * db,float * dc,int n)
{
int tid = threadIdx.x;
dc [tid] = 0;
for(int stride = 1; stride< n; stride * = 2){
if(tid%(2 * stride)== 0)
dc [tid] + = da [tid] * db [tid])+(da [tid + stride] * db [tid + stride]);
__syncthreads();
}
}
我的完整代码: http://pastebin.com/zS85URX5
<希望你能找出为什么它适用于n = 2的情况,所以让我们跳过,看看为什么它失败了一些其他情况,让我们选择n = 4。当你的for循环的第一次迭代,stride = 1,所以通过的线程,如果测试是线程0和2。
线程0:dc [0] + = da [0] * db [0] + da [1] * db [1];
thread 2:dc [2] + = da [2] * db [2] + da [3] * db [3]
到目前为止很好。在for循环的第二次迭代中,stride为2,因此通过if测试的线程为线程0(仅)。
线程0:dc [0] + = da [0] * db [0] + da [2] * db [2]
但这没有意义,并不是我们想要的。我们想要的是:
dc [0] + = dc [2]
我花了一点时间试图想想如何解决这个只是几个步骤,但它只是没有意义,我作为一个减少。如果你用这个代码替换你的内核代码,我想你会有好的结果。它不是很像你的代码,但它是最接近我可以来的东西,将适用于所有的情况下,你设想(即n 注意,我不是真的使用 I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong. This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all.
Any help would be greatly appreciated. Thanks! Here is the regular code Here is my cuda kernel My complete code : http://pastebin.com/zS85URX5 Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3. In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2. So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only). But this doesn't make sense and is not what we want at all. What we want is something like: So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block): Note that I'm not really using the 这篇关于Cuda内核与减少 - 逻辑错误的点积的2矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
// CUDA内核代码
__global__ void sum_reduce(float * da,float * db,float * dc,int n)
{
int tid = threadIdx.x;
//对于线程的全宽,并行进行乘法运算
dc [tid] = da [tid] * db [tid];
//等待所有线程完成乘法步骤
__syncthreads();
int stride = blockDim.x;
while(stride> 1){
// handle odd step
if((stride& 1)&&(tid == 0))dc [0] + = dc [stride-1];
//依次将问题除以2
stride>> = 1;
//将每个上半元素添加到每个下半元素
if(tid< stride)dc [tid] + = dc [tid + stride]
//等待所有线程完成添加步骤
__syncthreads();
}
}
n
参数。由于您使用 n
线程启动内核,所以 blockDim.x
内置变量等于 n
。float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b
float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
hx += ha[i] * hb[i];
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
dc[tid] = 0;
for (int stride = 1; stride < n; stride *= 2) {
if (tid % (2 * stride) == 0)
dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
__syncthreads();
}
}
thread 0: dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2: dc[2] += da[2]*db[2] + da[3]*db[3];
thread 0: dc[0] += da[0]*db[0] + da[2]*db[2];
dc[0] += dc[2];
// CUDA kernel code
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
// do multiplication in parallel for full width of threads
dc[tid] = da[tid] * db[tid];
// wait for all threads to complete multiply step
__syncthreads();
int stride = blockDim.x;
while (stride > 1){
// handle odd step
if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
// successively divide problem by 2
stride >>= 1;
// add each upper half element to each lower half element
if (tid < stride) dc[tid] += dc[tid + stride];
// wait for all threads to complete add step
__syncthreads();
}
}
n
parameter. Since you are launching the kernel with n
threads, the blockDim.x
built-in variable is equal to n
in this case.