CUDA矩阵加法定时,行Vs.按列 [英] CUDA Matrix Addition Timings, By Row Vs. By Column

查看:150
本文介绍了CUDA矩阵加法定时,行Vs.按列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在学习CUDA,并正在进行一些练习。其中之一是实现以3种不同方式添加矩阵的内核:每个元素1个线程,每行1个线程,每列1个线程。矩阵是正方形的,并且被实现为1D向量,我简单地用

I am currently learning CUDA, and am working through some exercises. One of them is to implement kernels that add matrices in 3 different ways: 1 thread per element, 1 thread per row, and 1 thread per column. The matrices are square, and are implemented as 1D vectors, that I simply index into with

A[N*row + col]



直观地,我预计第一个选项是最慢的由于线程开销,第二个是最快,因为单个线程将在相邻数据上工作。

Intuitively, I expected the first option to be the slowest due to thread overhead, the second to be the fastest since a single thread would be working on adjacent data.

在CPU上,使用密度为8000 x 8000的矩阵,我得到:

On the CPU, with dense matrices of 8000 x 8000 I get:

Adding on CPU - Adding down columns
Compute Time Taken: 2.21e+00 s
Adding on CPU - Adding across rows
Compute Time Taken: 2.52e-01 s

因为有更多的缓存命中,所以有一个数量级的加速。在具有相同矩阵的GPU上,我得到:

So about an order of magnitude speed up due to many more cache hits. On the GPU with the same matrices I get:

Adding one element per thread 
Compute Time Taken: 7.42e-05 s
Adding one row per thread 
Compute Time Taken: 2.52e-05 s
Adding one column per thread 
Compute Time Taken: 1.57e-05 s

这对我来说不直观。对于最后一种情况,30-40%的加速度高于约1000×1000矩阵。注意,这些时序只是内核执行,而不包括主机和设备之间的数据传输。下面是我的两个内核进行比较。

Which in non-intuitive to me. The 30-40% speed up for the last case is consistent above about 1000 x 1000 matrices. Note that these timings are only the kernel execution, and don't include the data transfer between host and device. Below are my two kernels for comparison.

__global__
void matAddKernel2(float* A, float* B, float* C, int N)
{
        int row = threadIdx.x + blockDim.x * blockIdx.x;
        if (row < N)
        {
                int j;
                for (j = 0; j < N; j++)
                {
                        C[N*row + j] = A[N*row + j] + B[N*row + j];
                }
        }
}



__global__
void matAddKernel3(float* A, float* B, float* C, int N)
{
        int col = threadIdx.x + blockDim.x * blockIdx.x;
        int j;

        if (col < N)
        {
                for (j = 0; j < N; j++)
                {
                        C[col + N*j] = A[col + N*j] + B[col + N*j];
                }
        }
}

我的问题是,为什么GPU线程似乎受益于对相邻数据的处理,这将有助于它获得更多的缓存命中?

My question is, why don't GPU threads seem to benefit from working on adjacent data, which would then help it to get more cache hits?

推荐答案

线程确实受益于对相邻数据的工作,你缺少的是GPU线程不是独立的线程,如CPU线程,他们工作在一个称为warp的组。一个warp集合在一起的32个线程,并且以类似的方式工作,类似于执行具有宽度32的SIMD指令的单个CPU线程。

GPU threads do benefit from working on adjacent data, what you are missing is that GPU threads are not independent threads like CPU thread, they work in a group called warp. A warp groups together 32 threads and works in a similar fashion like a single CPU thread executing SIMD instructions with width 32.

因此,在现实中,是最有效的,因为warp内部的相邻线程正在从内存访问相邻的数据位置,这是访问全局内存的最有效的方法。

So in reality the code that uses one thread per column is the most effective because adjacent threads inside a warp are accessing adjacent data locations from memory and that is the most effective way to access global memory.

CUDA文档

这篇关于CUDA矩阵加法定时,行Vs.按列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆