CUDA:分配1d设备内存以将2d指针对指针主机阵列复制到GPU或从GPU复制2d指针到指针的主机阵列 [英] CUDA: Allocating 1d device memory to copy 2d pointer-to-pointer host array to and from GPU

查看:67
本文介绍了CUDA:分配1d设备内存以将2d指针对指针主机阵列复制到GPU或从GPU复制2d指针到指针的主机阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个项目,试图并行化和加速其他人设计的一些统计/数字计算脚本.在这个项目开始之前,我是编程的新手(我更是分析数学类型),因此,请原谅我对它的任何无知或完全误解.他们正在使用以下函数来生成矩阵:

I'm working on a project attempting to parallelize and speed up some statistical/numerical computation scripts designed by other people. Before this project started, I was a complete novice when it came to programming (I'm more the analytical math type), so please forgive me for any ensuing ignorance or complete misunderstanding. They're using the following function to generate matrices:

double ** CreateMatrix(int m, int n)
{
    int i;
    double **A;
    // pointer allocation to rows
    A = (double **) malloc((size_t)((m*n)*sizeof(double)));
    // allocate rows and set pointers
    A[0] = (double *) malloc((size_t)((m*n)*sizeof(double)));
    for(i=1; i<=m; i++){
        A[i]=A[i-1] + n;
    }
    // return the pointer to array of pointers to rows
    return A;
}

我不希望重新设计其矩阵对象的基本结构,因为他们围绕其设计了整个代码,因此我一直在尝试将这些结构传递给GPU,但将其作为一维线性存储器传递给我,读取分配内存并将其复制到指针数组的效率太低,在GPU上效率太低.我试图使这个最基本的示例正常工作:

I'm not to keen on reworking the basic structure of their matrix objects as they've designed their entire code around it, so I've been trying to pass these structures to the GPU but as 1D linear memory as I've read allocating memory for and copying a pointer to an array of pointers is inefficient on the GPU is too inefficient. I've tried to get this most basic example working:

__global__ void MatrixMult(double *A, double *B, double *C, int N)
{
    int col = blockDim.x*blockIdx.x + threadIdx.x;
    int row = blockDim.y*blockIdx.y + threadIdx.y;

    if( col < N && row < N){
        C[col*N + row] = A[col*N + row] + B[col*N + row]; 
        //C[col][row] = B[col][row] + A[col][row];
    }

}

const int N = 5000;

int main()
{
    double **h_A,**h_B, **h_C;
    h_A = CreateMatrix(N,N);
    h_B = CreateMatrix(N,N);
    h_C = CreateMatrix(N,N);
    for(int i=0; i<N; i++){
        for(int j=0; j<N; j++){
            h_A[i][j]=1;
            h_B[i][j]=6;
            h_C[i][j]=0;
        }
    }

    size_t pitchA,pitchB,pitchC;

    double *d_A,*d_B,*d_C;

    cudaMallocPitch(&d_A, &pitchA, N*sizeof(double), N);
    cudaMallocPitch(&d_B, &pitchB, N*sizeof(double), N);
    cudaMallocPitch(&d_C, &pitchC, N*sizeof(double), N);
    cudaMemcpy2D(d_A, pitchA, h_A, N*sizeof(double), N*sizeof(double), N, cudaMemcpyHostToDevice);
    cudaMemcpy2D(d_B, pitchB, h_B, N*sizeof(double), N*sizeof(double), N, cudaMemcpyHostToDevice);
    cudaMemcpy2D(d_C, pitchC, h_C, N*sizeof(double), N*sizeof(double), N, cudaMemcpyHostToDevice);

    dim3 GridSize(250,250,1);
    dim3 BlockSize(20,20,1);

    MatrixMult<<<GridSize, BlockSize>>>(d_A,d_B,d_C,N);

    cudaMemcpy2D(h_C, N*sizeof(double), d_C,pitchC, N*sizeof(double), N, cudaMemcpyDeviceToHost);
    PrintMatrix(h_C,N,N);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

问题是当我尝试使用PrintMatrix函数检查我的结果时出现段错误:

The problem is I get a segfault when I try to use the PrintMatrix function to check my the results:

void PrintMatrix(double **A, int m, int n)
{
    int i, j;
    for(i=0; i<m; i++){
        for(j=0; j<n; j++){
            cout << A[i][j] << "\t";
        }
        cout << "\n";
    }
}

我想我不了解内存的一些细微调整.我想我的第一个问题是,是否有可能将2D double**对象作为1D double*传递到设备,进行一些计算,然后将其复制回主机上的原始double**格式?如果是这样,有人可以告诉我我在想什么吗?

I guess there's some subtle realignment of memory I'm not understanding. I guess my first question is if it's possible to pass a 2D double** object as a 1D double* to the device, do some computation, then copy it back to it's original double** format on the host? If so, can someone tell me what I'm missing?

推荐答案

我相信您的CreateMatrix已损坏,但可能具有功能性(我下面的版本与您的版本略有不同,尽管您的版本可能会起作用).但是,主机和设备之间的常规矩阵处理已损坏.尽管cudaMemcpy2DcudaMallocPitch的名称实际上并不用于处理双指针数组(**).查看文档.

I believe your CreateMatrix was broken, but it was probably functional (the version I have below is slightly different than yours, although yours would probably work). However your general matrix handling between host and device was broken. cudaMemcpy2D and cudaMallocPitch are not actually for handling double pointer arrays (**) despite their names. Review the documentation.

但是,您的CreateMatrix(适当固定)确实允许对您的代码进行少许修改并可以正常工作. CreateMatrix巧妙地允许在主机上进行双下标访问,同时确保基础数据是连续的.因此,我们可以将A[0]用作直接指向A中连续基础数据的指针.这意味着我们可以使用普通的cudaMalloccudaMemcpy.这是一个完整的示例:

However your CreateMatrix (suitably fixed) does allow for your code to be only slightly modified and work correctly. CreateMatrix cleverly allows for doubly-subscripted access on the host, while at the same time ensuring that the underlying data is contiguous. Therefore we can use A[0] as a pointer directly to the contiguous underlying data in A. This means we can use ordinary cudaMalloc and cudaMemcpy. Here is a fully worked example:

#include <iostream>
#define MAT_DIM 32
#define T1_VAL 1
#define T2_VAL 6

double ** CreateMatrix(int m, int n)
{
    int i;
    double **A;
    // pointer allocation to rows
    A = (double **) malloc((size_t)(m*sizeof(double *)));
    // allocate rows and set pointers
    A[0] = (double *) malloc((size_t)((m*n)*sizeof(double)));
    for(i=1; i<=m; i++){
        A[i]=A[i-1] + n;
    }
    // return the pointer to array of pointers to rows
    return A;
}

void PrintMatrix(double **A, int m, int n)
{
    int i, j;
    for(i=0; i<m; i++){
        for(j=0; j<n; j++){
            std::cout << A[i][j] << "\t";
        }
        std::cout << "\n";
    }
}

int ValidateMatrix(double **A, int m, int n)
{
    int i, j;
    for(i=0; i<m; i++)
        for(j=0; j<n; j++)
            if (A[i][j] != (T1_VAL+T2_VAL)) {printf("mismatch at %d, %d, value: %f\n", i,j,A[i][j]); return 0;}
    return 1;
}

__global__ void MatrixMult(double *A, double *B, double *C, int N)
{
    int col = blockDim.x*blockIdx.x + threadIdx.x;
    int row = blockDim.y*blockIdx.y + threadIdx.y;

    if( (col < N) && (row < N)){
        C[col*N + row] = A[col*N + row] + B[col*N + row];
        //C[col][row] = B[col][row] + A[col][row];
    }

}

const int N = MAT_DIM;

int main()
{
    double **h_A,**h_B, **h_C;
    h_A = CreateMatrix(N,N);
    h_B = CreateMatrix(N,N);
    h_C = CreateMatrix(N,N);
    for(int i=0; i<N; i++){
        for(int j=0; j<N; j++){
            h_A[i][j]=T1_VAL;
            h_B[i][j]=T2_VAL;
            h_C[i][j]=0;
        }
    }

    double *d_A,*d_B,*d_C;

    cudaMalloc(&d_A, N*N*sizeof(double));
    cudaMalloc(&d_B, N*N*sizeof(double));
    cudaMalloc(&d_C, N*N*sizeof(double));
    cudaMemcpy(d_A, h_A[0], N*N*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B[0], N*N*sizeof(double), cudaMemcpyHostToDevice);

    dim3 BlockSize(16,16);
    dim3 GridSize((N+BlockSize.x-1)/BlockSize.x,(N+BlockSize.y-1)/BlockSize.y);

    MatrixMult<<<GridSize, BlockSize>>>(d_A,d_B,d_C,N);

    cudaMemcpy(h_C[0], d_C,N*N*sizeof(double),cudaMemcpyDeviceToHost);
    //PrintMatrix(h_C,N,N);
    if (!ValidateMatrix(h_C, N, N)) printf("Failure!\n");
    else printf("Success!\n");
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

您的PrintMatrix进行段错误的最接近原因是,从设备到主机的cudaMemcpy2D操作正在覆盖已建立的指针数组,该指针数组已被CreateMatrix索引到h_C中.如我所展示的,这是通过使用单个指向数组的指针来解决的.

The proximal reason that your PrintMatrix was segfaulting, was that the cudaMemcpy2D operation from device to host was overwriting the pointer array that had been established to index into h_C by CreateMatrix. This is fixed by using the single pointers into the arrays as I have shown.

PrintMatrix没什么问题,如果需要,您应该可以取消注释.我只是不想查看大型矩阵的打印输出.

There is nothing wrong with your PrintMatrix and you should be able to uncomment it if you want to. I just didn't want to look at printout for large matrices.

顺便说一句,您的MatrixMult内核实际上是在添加2个矩阵.我确定你知道.

As an aside, your MatrixMult kernel is actually adding 2 matrices. I'm sure you knew that.

这篇关于CUDA:分配1d设备内存以将2d指针对指针主机阵列复制到GPU或从GPU复制2d指针到指针的主机阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆