库达矩阵加法 [英] Cuda Matrix addition

查看:50
本文介绍了库达矩阵加法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以帮我在Cuda C中做矩阵加法吗?矩阵应该是正方形以及非正方形,块尺寸应该是2D.这是我所做的代码.但是它不适用于2 * 2以上的矩阵谁能帮我解决这个问题.....

Can anyone help me in doing matrix addition in Cuda C.Matrix should be square as well as non square and block dimension should be 2D.Here is the code what i have done.But it wont work for matrix above 2*2 matrix.Can anyone help me in solving this.....

#include <iostream>
#include <cuda.h>
#define blocksize 16
    
texture<float,> texVecA; 
texture<float,> texVecB; 

__constant__ int x;
__constant__ int y;

     
__global__ void MatrixAdd_d(float *C)
  {
    int N=x*y;
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    int j = blockIdx.y*blockDim.y + threadIdx.y;
    int index = i*N + j;
    
    float flValA = tex1Dfetch(texVecA, index);
    float flValB = tex1Dfetch(texVecB, index);
    
    if(i<n mode="hold" />    {
		C[index]=flValA +flValB; 
	}
	
  }
     
 int main()
 {
    float *a_h, *b_h, *c_h; // pointers to host memory; a.k.a. CPU
    float *a_d, *b_d, *c_d; // pointers to device memory; a.k.a. GPU
    int n,m, i, j, index;
    printf("Enter dimension of matrix\n");
    scanf("%d%d",&n,&m);
    int N=m*n;
    // allocate arrays on host
    a_h = (float *)malloc(sizeof(float)*n*m);
    b_h = (float *)malloc(sizeof(float)*n*m);
    c_h = (float *)malloc(sizeof(float)*n*m);
    
   
    // allocate arrays on device
    cudaMalloc((void **)&a_d,m*n*sizeof(float));
    cudaMalloc((void **)&b_d,m*n*sizeof(float));
    cudaMalloc((void **)&c_d,m*n*sizeof(float));
   
    // initialize the arrays
   
    
	printf("Enter elements of first Matrix:\n");

	for(int i=0;i<n;i++)>
	{
		for(int j=0;j<m;j++)>
		{
			scanf("%f",&a_h[i * m + j]);
		}
	}
	printf("Enter elements of second matrix:\n");
	
	for(int i=0;i<n;i++)>
    {
		for(int j=0;j<m;j++)>
		{
			scanf_s("%f",&b_h[i * m + j]);

		}
	}


    
    // copy and run the code on the device
    cudaMemcpy(a_d,a_h,N*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(b_d,b_h,N*sizeof(float),cudaMemcpyHostToDevice);
    
    cudaMemcpyToSymbol(x, &m,  sizeof(int), 0);
	cudaMemcpyToSymbol(y, &n,  sizeof(int), 0);
    
    cudaBindTexture(0, texVecA, a_d, (N * sizeof(float)));
	cudaBindTexture(0, texVecB, b_d, (N * sizeof(float)));
	
	dim3 dimBlock( blocksize, blocksize );
    dim3 dimGrid( ceil(float(n)/float(dimBlock.x)), ceil(float(n)/float(dimBlock.y)) );
    
    MatrixAdd_d<<<dimgrid,>>>(c_d);
    cudaMemcpy(c_h,c_d,N*sizeof(float),cudaMemcpyDeviceToHost);
    cudaThreadSynchronize();
    
   
    
	
    // print out the answer
    for(j=0;j<n;j++)>
	{
		for(i=0;i<m;i++)>
			{
				index = j*m+i;
				printf("A + B = C: %d %d %f + %f = %f\n",i,j,a_h[index],b_h[index],c_h[index]);
			}
    }
    
	cudaUnbindTexture(texVecA);
	cudaUnbindTexture(texVecB);
    
    // cleanup...
    free(a_h);
    free(b_h);
    free(c_h);
    cudaFree(a_d);
    cudaFree(b_d);
    cudaFree(c_d);
    return(0);
    }</cuda.h></iostream>

推荐答案

^ ]有几个乘法的例子,就所有目的和目的而言,它们都与加法相同.

您不需要为此的纹理存储.本质上,您正在以线性方式访问整个内存块,这对于普通的全局内存来说是可以的.
纹理内存只是全局内存,在它和GPU之间需要额外的硬件,这会增加一些开销.如果您正确地使用纹理内存,则好处远不止于此,但不是.

现在,您的代码遇到了一些问题:
您计算索引的公式是倒数,您有x*size + y,应该为y*size + x.这将提供合并的内存读取,从而显着提高性能.

如果在if之后,应致电__syncthreads().请参阅 CUDA编程指南 [
The NVIDIA GPU Computing SDK[^] has a few examples of multiplication, which for all intents and purposes is the same as addition.

You shouldn''t need texture memory for this. You are essentially accessing the whole chunk of memory in a linear manner, which is fine from normal global memory.
Texture memory is just global memory with an extra bit of hardware between it and the GPU, which adds a slight overhead. If you are using the texture memory correctly, the benefits outweigh this, but you aren''t.

Now for some issues with your code:
Your equation for calculating the index is backwards, you have x*size + y, it should be y*size + x. This will give coalesced memory reads, which will give a massive increase in performance.

You should call __syncthreads() after your if. Refer to the CUDA programming guide[^] for more information.

I have to go somewhere now, and I haven''t run your code, so there could be some other issues, compare your code to the SDK samples.


这篇关于库达矩阵加法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆