CUDA用于大矩阵的矩阵乘法 [英] CUDA Matrix multiplication breaks for large matrices

查看:215
本文介绍了CUDA用于大矩阵的矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下矩阵乘法代码,使用CUDA 3.2和VS 2008.实现。我在Windows服务器2008 r2企业上运行。我正在运行一个Nvidia GTX 480.以下代码适用于宽度(矩阵宽度)的值高达约2500左右。

  int size = Width * Width * sizeof(float); 
float * Md,* Nd,* Pd;
cudaError_t err = cudaSuccess;

//为M,N和P分配设备内存
err = cudaMalloc((void **)& Md,size);
err = cudaMalloc((void **)& Nd,size);
err = cudaMalloc((void **)& Pd,size);

//将矩阵从主机内存复制到设备内存
err = cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice);
err = cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice);

//设置执行配置
dim3 dimBlock(TileWidth,TileWidth,1);
dim3 dimGrid(ceil((float)(Width)/ TileWidth),ceil((float)(Width)/ TileWidth),1);

MatrixMultiplicationMultiBlock_Kernel<<< dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

err = cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost);

//可用设备内存
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);

当我将宽度设置为3000或更大时,黑屏后会出现以下错误:



我在线上我看到有些人有这个问题,因为看门狗在杀死内核后挂起超过5秒钟。我试图在注册表中编辑TdrDelay,这延迟了黑色屏幕之前的时间,出现了相同的错误。所以我认为这不是我的问题。



我调试到我的代码,发现这行是罪魁祸首:

  err = cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost); 

这是我用来在我的矩阵乘法内核函数调用后从设备返回结果集。一切直到这一点似乎运行正常。我相信我正确分配内存,不能弄清楚为什么会发生这种情况。我想也许我没有足够的内存在我的卡为此,但是不应该cudaMalloc已经返回错误? (我确认它没有在调试时)。



任何想法/协助将非常感谢...感谢很多人!



内核代码:

  //矩阵乘法内核 - 多块执行
__global__ void MatrixMultiplicationMultiBlock_Kernel(float * Md,float * Nd,float * Pd,int Width)
{
int TileWidth = blockDim.x;

//从块和线程ID中获取行和列
int Row =(TileWidth * blockIdx.y)+ threadIdx.y;
int Column =(TileWidth * blockIdx.x)+ threadIdx.x;

// Pvalue存储由线程计算的Pd元素
float Pvalue = 0;

for(int i = 0; i {
float Mdelement = Md [Row * Width + i];
float Ndelement = Nd [i * Width + Column];
Pvalue + = Mdelement * Ndelement;
}

//将矩阵写入设备内存每个线程写一个元素
Pd [Row * Width + Column] = Pvalue;
}

我也有这个使用共享内存的函数,同样的错误:



调用:

  MatrixMultiplicationSharedMemory_Kernel< dimGrid,dimBlock,sizeof(float)* TileWidth * TileWidth * 2>>(Md,Nd,Pd,Width); 

内核代码:

  //矩阵乘法内核 - 共享内存实现
__global__ void MatrixMultiplicationSharedMemory_Kernel(float * Md,float * Nd,float * Pd,int Width)
{
int TileWidth = blockDim.x;

//初始化共享内存
extern __shared__ float sharedArrays [];
float * Mds =(float *)& sharedArrays;
float * Nds =(float *)& Mds [TileWidth * TileWidth];

int tx = threadIdx.x;
int ty = threadIdx.y;

//从块和线程ID中获取行和列
int Row =(TileWidth * blockIdx.y)+ ty;
int Column =(TileWidth * blockIdx.x)+ tx;
float Pvalue = 0;

//对于每个图块,将该元素加载到共享存储器中
for(int i = 0; i {
Mds [ty * TileWidth + tx] = Md [Row * Width +(i * TileWidth + tx)];
Nds [ty * TileWidth + tx] = Nd [(ty +(i * TileWidth))* Width + Column];

__syncthreads();

for(int j = 0; j< TileWidth; ++ j)
{
Pvalue + = Mds [ty * TileWidth + j] * Nds [j * TileWidth + tx];
}

__syncthreads();
}

//将矩阵写入设备内存每个线程写一个元素
Pd [Row * Width + Column] = Pvalue;
}


解决方案

控制WDDM超时



问题实际上是内核不是 cudaMemcpy()。当您启动内核时,GPU会关闭,并与CPU异步地工作,因此只有当您与GPU同步,您必须等待工作完成。 cudaMemcpy()涉及隐式同步,因此就是您看到的问题。



这通过在内核后调用 cudaThreadSynchronize(),问题将出现在 cudaThreadSynchronize()而不是 cudaMemcpy()



更改TDR超时后,是否重新启动计算机?不幸的是,需要重新启动Windows才能更改TDR设置。 此Microsoft文档对可用的完整设置有相当好的描述。 / p>

内核问题



实际上不是WDDM超时。在内核中有一些错误需要你解决(例如你应该能够在每次迭代中增加 i 多个),并检查 matrixMul 示例在SDK中可能是有用的。顺便提一下,我希望这是一个学习练习,因为在现实中,你将更好地使用CUBLAS来执行矩阵乘法。



代码中最关键的问题是你使用共享内存,而不实际分配任何。在你的内核中你有:

  //初始化共享内存
extern __shared__ float sharedArrays [];

但是当启动内核时,不指定为每个块分配多少共享内存: / p>

  MatrixMultiplicationMultiBlock_Kernel<<< dimGrid,dimBlock>>>(Md,Nd,Pd, 

<<< >>>语法实际上有四个参数,第三个和第四个可选的。第四个是流索引,用于在计算和数据传输(以及并发内核执行)之间重叠,但第三参数指定每个块的共享内存量。在这种情况下,我假设您要将 TileWidth * TileWidth 存储在共享内存中,因此您将使用:

  MatrixMultiplicationMultiBlock_Kernel<<<< dimGrid,dimBlock,dimBlock.x * dimBlock.x * sizeof(float)>>(Md,Nd,Pd,Width) 

主要问题

正如你在注释中提到的,实际的问题是你的矩阵宽度不是块宽度的倍数(和高度,因为它是正方形的,意味着超过结束的线程将访问超出了数组的末尾,代码应该处理非多数情况,或者应该确保宽度是块大小的倍数。



我应该但是通常运行 cuda-memcheck 来检查这样的memeory访问冲突通常是有用的。


I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so.

int size = Width*Width*sizeof(float);
float* Md, *Nd, *Pd;
cudaError_t err = cudaSuccess;

//Allocate Device Memory for M, N and P
err = cudaMalloc((void**)&Md, size);
err = cudaMalloc((void**)&Nd, size);
err = cudaMalloc((void**)&Pd, size);

//Copy Matrix from Host Memory to Device Memory
err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

//Setup the execution configuration
dim3 dimBlock(TileWidth, TileWidth, 1);
dim3 dimGrid(ceil((float)(Width)/TileWidth), ceil((float)(Width)/TileWidth), 1);

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

//Free Device Memory
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);

When I set the "Width" to 3000 or greater, I get the following error after a black screen:

I looked online and I saw that some people has this issue because the watchdog was killing the kernel after it hangs for more than 5 seconds. I tried editing the "TdrDelay" in the registry and this delayed the time before the black screen and same error appeared. So I concluded this was not my issue.

I debugged into my code and found this line to be the culprit:

err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

This is what I use to return my result set from the device after my matrix multiplication kernel function is called. Everything up until this point seems to run fine. I believe I am allocating memory correctly and cannot figure out why this is happening. I thought maybe I didn't have enough memory on my card for this but then shouldn't cudaMalloc have returned an error? (I confirmed it didn't while debugging).

Any ideas/assistance would be greatly appreciated!... Thanks a lot guys!!

Kernel code:

//Matrix Multiplication Kernel - Multi-Block Implementation
__global__ void MatrixMultiplicationMultiBlock_Kernel (float* Md, float* Nd, float* Pd, int Width) 
{
int TileWidth = blockDim.x;

//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + threadIdx.y;
int Column = (TileWidth*blockIdx.x) + threadIdx.x;

//Pvalue store the Pd element that is computed by the thread
float Pvalue = 0;

for (int i = 0; i < Width; ++i)
{
    float Mdelement = Md[Row * Width + i];
    float Ndelement = Nd[i * Width + Column];
    Pvalue += Mdelement * Ndelement;
}

//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}

I also have this other function that uses shared memory, and it also gives the same error:

Call:

            MatrixMultiplicationSharedMemory_Kernel<<<dimGrid, dimBlock, sizeof(float)*TileWidth*TileWidth*2>>>(Md, Nd, Pd, Width);

Kernel code:

 //Matrix Multiplication Kernel - Shared Memory Implementation
 __global__ void MatrixMultiplicationSharedMemory_Kernel (float* Md, float* Nd, float* Pd, int Width) 
 {
int TileWidth = blockDim.x;

//Initialize shared memory
extern __shared__ float sharedArrays[];
float* Mds = (float*) &sharedArrays;
float* Nds = (float*) &Mds[TileWidth*TileWidth];

int tx = threadIdx.x;
int ty = threadIdx.y;

//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + ty;
int Column = (TileWidth*blockIdx.x) + tx;
float Pvalue = 0;

//For each tile, load the element into shared memory
for( int i = 0; i < ceil((float)Width/TileWidth); ++i)
{
    Mds[ty*TileWidth+tx] = Md[Row*Width + (i*TileWidth + tx)];
    Nds[ty*TileWidth+tx] = Nd[(ty + (i * TileWidth))*Width + Column]; 

    __syncthreads();

    for( int j = 0; j < TileWidth; ++j)
    {
        Pvalue += Mds[ty*TileWidth+j] * Nds[j*TileWidth+tx];
    }

    __syncthreads();
}

//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}

解决方案

Controlling the WDDM Timeout

The problem is actually the kernel not the cudaMemcpy(). When you launch the kernel the GPU goes off and does the work asynchronously with the CPU, so it's only when you synchronize with the GPU that you have to wait for the work to finish. cudaMemcpy() involves an implicit synchronization, hence that is where you see the problem.

You could double-check this by calling cudaThreadSynchronize() after the kernel and the problem will appear to be on the cudaThreadSynchronize() instead of the cudaMemcpy().

After changing the TDR timeout, did you restart your machine? Unfortunately Windows needs to be restarted to change the TDR settings. This Microsoft document has a fairly good description of the full settings available.

Kernel problems

In this case the problem is not actually the WDDM timeout. There are errors in the kernel which you would need to resolve (for example you should be able to incremement i by more than one on each iteration) and checking out the matrixMul sample in the SDK may be useful. Incidentally, I hope this is a learning exercise since in reality you would be better off (for performance) using CUBLAS to perform matrix multiplication.

The most critical problem in the code is that you are using shared memory without actually allocating any. In your kernel you have:

//Initialize shared memory
extern __shared__ float sharedArrays[];

But when you launch the kernel you do not specify how much shared memory to allocate for each block:

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

The <<<>>> syntax actually takes four arguments where the third and fourth are optional. The fourth is the stream index which is used to get overlap between compute and data transfer (and for concurrent kernel execution) but the third argument specifies the amount of shared memory per block. In this case I assume you want to store TileWidth * TileWidth floats in the shared memory, so you would use:

MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock, dimBlock.x * dimBlock.x * sizeof(float)>>>(Md, Nd, Pd, Width);

The main problem

As you mention in your comment, the actual problem was that your matrix width was not a multiple of the block width (and height since it is square, meaning the threads beyond the end would access beyond the end of the array. The code should either handle the non-multiple case or it should ensure that the width is a multiple of the block size.

I should have suggested this earlier, but it is often useful to run cuda-memcheck to check for memeory access violations like this.

这篇关于CUDA用于大矩阵的矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆