CUDA点积 [英] CUDA dot product

查看:125
本文介绍了CUDA点积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个cuda教程,在该教程中,我必须制作两个向量的点积.在实施了本教程中提供的解决方案之后,我遇到了

I was doing a cuda tutorial in which I have to make the dot product of two vectors. After implementing the solution provided in the tutorial I came across some issues that were solved in this stack overflow post. Now I am receiving the answer 0 regardless what I do. Bellow you can find the code!

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "device_atomic_functions.h"
#include <stdio.h>
#include <stdlib.h>
#define N (2048 * 8)
#define THREADS_PER_BLOCK 512

__global__ void dot(int *a, int *b, int *c)
{
    __shared__ int temp[THREADS_PER_BLOCK];
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    temp[threadIdx.x] = a[index] * b[index];

    __syncthreads();

    if (threadIdx.x == 0)
    {
        int sum = 0;
        for (int i = 0; i < N; i++)
        {
            sum += temp[i];
        }
        atomicAdd(c, sum);
    }
}

int main()
{
    int *a, *b, *c;
    int *dev_a, *dev_b, *dev_c;
    int size = N * sizeof(int);

   //allocate space for the variables on the device
    cudaMalloc((void **)&dev_a, size);
    cudaMalloc((void **)&dev_b, size);
    cudaMalloc((void **)&dev_c, sizeof(int));

   //allocate space for the variables on the host
   a = (int *)malloc(size);
   b = (int *)malloc(size);
   c = (int *)malloc(sizeof(int));

   //this is our ground truth
   int sumTest = 0;
   //generate numbers
   for (int i = 0; i < N; i++)
   {
       a[i] = rand() % 10;
       b[i] = rand() % 10;
       sumTest += a[i] * b[i];
       printf(" %d %d \n",a[i],b[i]);
   }

   *c = 0;

   cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
   cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
   cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice);

   dot<<< N / THREADS_PER_BLOCK, THREADS_PER_BLOCK >> >(dev_a, dev_b,    dev_c);

   cudaMemcpy(c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);

   printf("%d ", *c);
   printf("%d ", sumTest);

   free(a);
   free(b);
   free(c);

   cudaFree(a);
   cudaFree(b);
   cudaFree(c);

   system("pause");

   return 0;

 }

推荐答案

首先,请按照在内核执行调用之前,您将在下面的行中将额外的内存复制到dev_c中:

Just before the kernel execution call, you are copying extra memory into dev_c in the following line:

cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice);

应该是:

cudaMemcpy(dev_c, c, sizeof(int), cudaMemcpyHostToDevice);

代码中的另一个错误是在内核内部,__shared__内存变量temp正在for循环中被访问.循环迭代到N时,共享内存的元素数等于THREADS_PER_BLOCK.只需在循环中将N替换为THREADS_PER_BLOCK.

Another error in the code is that inside the kernel, __shared__ memory variable temp is being accessed out of bounds in the for loop. Number of elements of the shared memory is equal to THREADS_PER_BLOCK while the loop is being iterated upto N. Just replace N with THREADS_PER_BLOCK in the loop.

这篇关于CUDA点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆