使用cudaMallocPitch分配2D数组,并使用cudaMemcpy2D复制 [英] Allocate 2D array with cudaMallocPitch and copying with cudaMemcpy2D

查看:873
本文介绍了使用cudaMallocPitch分配2D数组,并使用cudaMemcpy2D复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA的新用户,非常感谢您的帮助,希望您能帮助我。

I'm new in CUDA, I appreciate your help and hope you can help me.

我需要将一个2D数组的多个元素存储到一个向量中,然后使用向量,但我的代码不工作,当我调试,我发现一个错误,分配设备中的二维数组与 cudaMallocPitch 并复制到数组 cudaMemcpy2D 。这是我的代码:

I need to store multiple elements of a 2D array into a vector, and then work with the vector, but my code does not work well, when I debug, I find a mistake in allocating the 2D array in the device with cudaMallocPitch and copying to that array with cudaMemcpy2D. This is my code:

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cmath>

#define maxThreads 96

__global__ void extract(int mSize, float* dev_vector, float* dev_matrix, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    while(idx<N)
    {
        dev_vector[idx] = *(dev_matrix+(mSize*idx+N));
        idx += blockDim.x * gridDim.x;
    }
}

int main()
{
    //CPU variables
    int mSize = 5;
    float* matrix;
    int N = 4; // Vector size
    int i,j;
    float* vector;
    int blocks, threads;

    float* dev_matrix;
    float* dev_vector;

    blocks = 1+((N-1)/maxThreads);
    threads = 1+((N-1)/blocks);

    unsigned long int pitch;
    unsigned long int memsize_vector = N*sizeof(float);
    unsigned long int memsize_matrix = mSize*sizeof(float);


    matrix = new float[memsize_matrix*memsize_matrix];
    vector = new float[memsize_vector];

    //Create 2D array
    for(i=0; i<mSize; i++)
        for(j=0; j<mSize; j++)
        {
            matrix[i+mSize*j] = ((i+1)+(j+1));
        }

    printf("\n");
    for (i=0; i<mSize; i++){
        for(j=0; j<mSize; j++){
            printf("% 1.5f ", matrix[i+mSize*j]);
        }
        printf("\n");
    }
    printf("\n");


    cudaMallocPitch((void **)&dev_matrix, &pitch, memsize_matrix, mSize);
    cudaMalloc((void **)&dev_vector, memsize_vector);

    cudaMemcpy2D(dev_matrix, pitch, matrix, memsize_matrix, memsize_matrix, mSize,
                     cudaMemcpyHostToDevice);

    extract<<<blocks,threads>>>(mSize, dev_vector, dev_matrix, N);
    cudaDeviceSynchronize();

    cudaMemcpy(vector, dev_vector, memsize_vector, cudaMemcpyDeviceToHost);

    printf("Vector values are:\n");
    for(i=0; i<N; i++)
        printf(" % 1.5f ", vector[i]);
    printf("\n");

    cudaFree(dev_matrix);
    cudaFree(dev_vector);

}


推荐答案

此代码中的问题,包括但不限于在代码中的几个位置使用不同类型的字节和字大小交替使用数组大小​​(请注意 size_t 存在一个很好的理由),潜在的截断和类型转换问题,等等。

There are lots of problems in this code, including but not limited to using array sizes in bytes and word sizes interchangeably in several places in code, using incorrect types (note that size_t exists for a very good reason) , potential truncation and type casting problems, and more.

但是核心问题是内核中的倾斜内存的寻址,你甚至不会传递pitch值。阅读 cudaMallocPitch的文档 将为您提供正确的方法来寻址内核中的匹配内存。您的内核可能如下:

But the core problem is the addressing of pitched memory inside the kernel, to which you are never even passing the pitch value. Reading the documentation for cudaMallocPitch will give you the correct method for addressing pitched memory inside a kernel. Your kernel might then look like this:

__global__ void extract(size_t mpitch, float* dev_vector, float* dev_matrix, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    while(idx<N)
    {          
        dev_vector[idx] = *(float *)( ((char*)dev_matrix + idx * mpitch) + N );
        idx += stride;
    }
}

[免责声明:未经编译或测试,风险]。

[disclaimer: never compiled or tested, use at own risk].

您必须修复主机代码中的所有问题,以反映您所做的任何内核更改。

You will have to fix then all the problems in the host code to reflect whatever kernel changes you make.

这篇关于使用cudaMallocPitch分配2D数组,并使用cudaMemcpy2D复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆