CUDA内核应该是动态崩溃，具体取决于块大小 [英] CUDA-Kernel supposed to be dynamic crashes depending upon block size

查看：50 发布时间：2021/4/27 20:08:00 cuda

本文介绍了CUDA内核应该是动态崩溃，具体取决于块大小的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想做一个稀疏矩阵，密集向量乘法.假设用于压缩Matrix中条目的唯一存储格式是压缩行存储CRS.

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS.

我的内核如下所示:

__global__ void
krnlSpMVmul1(
        float *data_mat,
        int num_nonzeroes,
        unsigned int *row_ptr,
        float *data_vec,
        float *data_result)
{
    extern __shared__ float local_result[];
    local_result[threadIdx.x] = 0;

    float vector_elem = data_vec[blockIdx.x];

    unsigned int start_index = row_ptr[blockIdx.x];
    unsigned int end_index = row_ptr[blockIdx.x + 1];

    for (int index = (start_index + threadIdx.x); (index < end_index) && (index < num_nonzeroes); index += blockDim.x)
        local_result[threadIdx.x] += (data_mat[index] * vector_elem);

    __syncthreads();

   // Reduction

   // Writing accumulated sum into result vector
}

如您所见，内核被认为是尽可能幼稚的，甚至可以做一些错误的事情(例如， vector_elem 并不总是正确的值).我知道这些事情.

As you can see the kernel is supposed to be as naive as possible and it even does a few things wrong (e.g. vector_elem is just not always the correct value). I am aware of those things.

由于我的问题:假设我使用的是32或64个线程的块大小.只要矩阵中的一行中有16个以上的非零值(例如17)，就只会完成前16个乘法并将其保存到共享内存中.我知道 local_result [16] 的值是第17个乘法的结果，仅为零.使用16或128个线程的块大小可以解决上述问题.

Now to my problem: Suppose I am using a blocksize of 32 or 64 threads. As soon as a row in my matrix has more than 16 nonzeroes (e.g. 17) only the first 16 multiplications are done and save to shared memory. I know that the value at local_result[16] which is the result of the 17th multiplication is just zero. Using a blocksize of 16 or 128 threads fixes the explained problem.

由于我是CUDA的新手，所以我可能忽略了最简单的事情，但是我无法弥补其他情况.

Since I am fairly new to CUDA I might have overlooked the simplest thing but I cannot make up any more situations to look at.

非常感谢您的帮助！

编辑对爪子的评论:

我在计算后立即打印了 local_result [16] 中的值.它是 0 .不过，这是缺少的代码:

I printed the values which were in local_result[16] directly after the computation. It was 0. Nevertheless, here is the missing code:

减少部分:

int k = blockDim.x / 2;
while (k != 0)
{
    if (threadIdx.x < k)
        local_result[threadIdx.x] += local_result[threadIdx.x + k];
    else
        return;

    __syncthreads();

    k /= 2;
}

以及如何将结果写回到全局内存中

and how I write the results back to global memory:

data_result[blockIdx.x] = local_result[0];

这就是我所得到的.

现在，我正在测试一种由矩阵组成的场景，该矩阵由具有17个元素的单行组成，所有元素都不为零.伪代码中的缓冲区看起来像这样:

Right now I am testing a scenario with a matrix consisting of a single row with 17 element which all are non-zeroes. The buffers look like this in pseudocode:

float data_mat[17] = { val0, .., val16 }
unsigned int row_ptr[2] = { 0, 17 }
float data_vec[17] = { val0 } // all values are the same
float data_result[1] = { 0 }

那是我的包装函数的摘录:

And thats an excerpt of my wrapper function:

float *dev_data_mat;
unsigned int *dev_row_ptr;
float *dev_data_vec;
float *dev_data_result;

// Allocate memory on the device
HANDLE_ERROR(cudaMalloc((void**) &dev_data_mat, num_nonzeroes * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_row_ptr, num_row_ptr * sizeof(unsigned int)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_vec, dim_x * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_result, dim_y * sizeof(float)));

// Copy each buffer into the allocated memory
HANDLE_ERROR(cudaMemcpy(
        dev_data_mat,
        data_mat,
        num_nonzeroes * sizeof(float),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_row_ptr,
        row_ptr,
        num_row_ptr * sizeof(unsigned int),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_data_vec,
        data_vec,
        dim_x * sizeof(float),
        cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
        dev_data_result,
        data_result,
        dim_y * sizeof(float),
        cudaMemcpyHostToDevice));

// Calc grid dimension and block dimension
dim3 grid_dim(dim_y);
dim3 block_dim(BLOCK_SIZE);

// Start kernel
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(
        dev_data_mat,
        num_nonzeroes,
        dev_row_ptr,
        dev_data_vec,
        dev_data_result);

我希望这很简单，但是如果有兴趣的话，会解释一下.

I hope this is straightforward but will explain things if it is of any interest.

另一件事:我刚刚意识到，使用 BLOCK_SIZE 为128并具有33个非零值会使内核也失败.同样，仅不计算最后一个值.

One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well. Again just the last value is not being computed.

CUDA内核应该是动态崩溃，具体取决于块大小 [英] CUDA-Kernel supposed to be dynamic crashes depending upon block size

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

CUDA内核应该是动态崩溃，具体取决于块大小 [英] CUDA-Kernel supposed to be dynamic crashes depending upon block size

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭