CUDA矩阵加法定时，行Vs.按列 [英] CUDA Matrix Addition Timings, By Row Vs. By Column

查看：150 发布时间：2017/3/4 15:22:02 performance cuda

本文介绍了CUDA矩阵加法定时，行Vs.按列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在学习CUDA，并正在进行一些练习。其中之一是实现以3种不同方式添加矩阵的内核：每个元素1个线程，每行1个线程，每列1个线程。矩阵是正方形的，并且被实现为1D向量，我简单地用

I am currently learning CUDA, and am working through some exercises. One of them is to implement kernels that add matrices in 3 different ways: 1 thread per element, 1 thread per row, and 1 thread per column. The matrices are square, and are implemented as 1D vectors, that I simply index into with

A[N*row + col]

直观地，我预计第一个选项是最慢的由于线程开销，第二个是最快，因为单个线程将在相邻数据上工作。

Intuitively, I expected the first option to be the slowest due to thread overhead, the second to be the fastest since a single thread would be working on adjacent data.

在CPU上，使用密度为8000 x 8000的矩阵，我得到：

On the CPU, with dense matrices of 8000 x 8000 I get:

Adding on CPU - Adding down columns
Compute Time Taken: 2.21e+00 s
Adding on CPU - Adding across rows
Compute Time Taken: 2.52e-01 s

因为有更多的缓存命中，所以有一个数量级的加速。在具有相同矩阵的GPU上，我得到：

So about an order of magnitude speed up due to many more cache hits. On the GPU with the same matrices I get:

Adding one element per thread 
Compute Time Taken: 7.42e-05 s
Adding one row per thread 
Compute Time Taken: 2.52e-05 s
Adding one column per thread 
Compute Time Taken: 1.57e-05 s

这对我来说不直观。对于最后一种情况，30-40％的加速度高于约1000×1000矩阵。注意，这些时序只是内核执行，而不包括主机和设备之间的数据传输。下面是我的两个内核进行比较。

Which in non-intuitive to me. The 30-40% speed up for the last case is consistent above about 1000 x 1000 matrices. Note that these timings are only the kernel execution, and don't include the data transfer between host and device. Below are my two kernels for comparison.

__global__
void matAddKernel2(float* A, float* B, float* C, int N)
{
        int row = threadIdx.x + blockDim.x * blockIdx.x;
        if (row < N)
        {
                int j;
                for (j = 0; j < N; j++)
                {
                        C[N*row + j] = A[N*row + j] + B[N*row + j];
                }
        }
}



__global__
void matAddKernel3(float* A, float* B, float* C, int N)
{
        int col = threadIdx.x + blockDim.x * blockIdx.x;
        int j;

        if (col < N)
        {
                for (j = 0; j < N; j++)
                {
                        C[col + N*j] = A[col + N*j] + B[col + N*j];
                }
        }
}

我的问题是，为什么GPU线程似乎受益于对相邻数据的处理，这将有助于它获得更多的缓存命中？

My question is, why don't GPU threads seem to benefit from working on adjacent data, which would then help it to get more cache hits?

CUDA矩阵加法定时，行Vs.按列 [英] CUDA Matrix Addition Timings, By Row Vs. By Column

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

CUDA矩阵加法定时，行Vs.按列 [英] CUDA Matrix Addition Timings, By Row Vs. By Column

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭