CUDA内核应该是动态崩溃,具体取决于块大小 [英] CUDA-Kernel supposed to be dynamic crashes depending upon block size

查看:50
本文介绍了CUDA内核应该是动态崩溃,具体取决于块大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做一个稀疏矩阵,密集向量乘法.假设用于压缩Matrix中条目的唯一存储格式是压缩行存储CRS.

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS.

我的内核如下所示:

__global__ void
krnlSpMVmul1(
        float *data_mat,
        int num_nonzeroes,
        unsigned int *row_ptr,
        float *data_vec,
        float *data_result)
{
    extern __shared__ float local_result[];
    local_result[threadIdx.x] = 0;

    float vector_elem = data_vec[blockIdx.x];

    unsigned int start_index = row_ptr[blockIdx.x];
    unsigned int end_index = row_ptr[blockIdx.x + 1];

    for (int index = (start_index + threadIdx.x); (index < end_index) && (index < num_nonzeroes); index += blockDim.x)
        local_result[threadIdx.x] += (data_mat[index] * vector_elem);

    __syncthreads();

   // Reduction

   // Writing accumulated sum into result vector
}

如您所见,内核被认为是尽可能幼稚的,甚至可以做一些错误的事情(例如, vector_elem 并不总是正确的值).我知道这些事情.

As you can see the kernel is supposed to be as naive as possible and it even does a few things wrong (e.g. vector_elem is just not always the correct value). I am aware of those things.

由于我的问题:假设我使用的是32或64个线程的块大小.只要矩阵中的一行中有16个以上的非零值(例如17),就只会完成前16个乘法并将其保存到共享内存中.我知道 local_result [16] 的值是第17个乘法的结果,仅为零.使用16或128个线程的块大小可以解决上述问题.

Now to my problem: Suppose I am using a blocksize of 32 or 64 threads. As soon as a row in my matrix has more than 16 nonzeroes (e.g. 17) only the first 16 multiplications are done and save to shared memory. I know that the value at local_result[16] which is the result of the 17th multiplication is just zero. Using a blocksize of 16 or 128 threads fixes the explained problem.

由于我是CUDA的新手,所以我可能忽略了最简单的事情,但是我无法弥补其他情况.

Since I am fairly new to CUDA I might have overlooked the simplest thing but I cannot make up any more situations to look at.

非常感谢您的帮助!

编辑对爪子的评论:

我在计算后立即打印了 local_result [16] 中的值.它是 0 .不过,这是缺少的代码:

I printed the values which were in local_result[16] directly after the computation. It was 0. Nevertheless, here is the missing code:

减少部分:

int k = blockDim.x / 2;
while (k != 0)
{
    if (threadIdx.x < k)
        local_result[threadIdx.x] += local_result[threadIdx.x + k];
    else
        return;

    __syncthreads();

    k /= 2;
}

以及如何将结果写回到全局内存中

and how I write the results back to global memory:

data_result[blockIdx.x] = local_result[0];

这就是我所得到的.

现在,我正在测试一种由矩阵组成的场景,该矩阵由具有17个元素的单行组成,所有元素都不为零.伪代码中的缓冲区看起来像这样:

Right now I am testing a scenario with a matrix consisting of a single row with 17 element which all are non-zeroes. The buffers look like this in pseudocode:

float data_mat[17] = { val0, .., val16 }
unsigned int row_ptr[2] = { 0, 17 }
float data_vec[17] = { val0 } // all values are the same
float data_result[1] = { 0 }

那是我的包装函数的摘录:

And thats an excerpt of my wrapper function:

float *dev_data_mat;
unsigned int *dev_row_ptr;
float *dev_data_vec;
float *dev_data_result;

// Allocate memory on the device
HANDLE_ERROR(cudaMalloc((void**) &dev_data_mat, num_nonzeroes * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_row_ptr, num_row_ptr * sizeof(unsigned int)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_vec, dim_x * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_result, dim_y * sizeof(float)));

// Copy each buffer into the allocated memory
HANDLE_ERROR(cudaMemcpy(
        dev_data_mat,
        data_mat,
        num_nonzeroes * sizeof(float),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_row_ptr,
        row_ptr,
        num_row_ptr * sizeof(unsigned int),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_data_vec,
        data_vec,
        dim_x * sizeof(float),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_data_result,
        data_result,
        dim_y * sizeof(float),
        cudaMemcpyHostToDevice));

// Calc grid dimension and block dimension
dim3 grid_dim(dim_y);
dim3 block_dim(BLOCK_SIZE);

// Start kernel
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(
        dev_data_mat,
        num_nonzeroes,
        dev_row_ptr,
        dev_data_vec,
        dev_data_result);

我希望这很简单,但是如果有兴趣的话,会解释一下.

I hope this is straightforward but will explain things if it is of any interest.

另一件事:我刚刚意识到,使用 BLOCK_SIZE 为128并具有33个非零值会使内核也失败.同样,仅不计算最后一个值.

One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well. Again just the last value is not being computed.

推荐答案

动态分配的共享内存大小不正确.现在,您正在执行此操作:

Your dynamically allocated shared memory size is incorrect. Right now you are doing this:

krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(.....)

共享内存的大小应以 bytes 为单位.每块使用64个线程,这意味着您将为16个浮点数的单词分配足够的共享内存,并解释了为什么每行17个魔术条目会导致失败-您有共享缓冲区溢出,这将在内存中触发保护错误.GPU并中止内核.

The shared memory size should be given in bytes. Using your 64 threads per block case, that means you would be allocating enough shared memory for 16 float sized words and explains why the magic 17 entries per row case results in failure - you have a shared buffer overflow which will trigger a protection fault in the GPU and abort the kernel.

您应该这样做:

krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE * sizeof(float)>>>(.....)

这将为您提供正确的动态共享内存大小,并应消除此问题.

That will give you the correct dynamic shared memory size and should eliminate the problem.

这篇关于CUDA内核应该是动态崩溃,具体取决于块大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆