使用 cudaMallocPitch 分配二维数组并使用 cudaMemcpy2D 进行复制 [英] Allocate 2D array with cudaMallocPitch and copying with cudaMemcpy2D

查看:20
本文介绍了使用 cudaMallocPitch 分配二维数组并使用 cudaMemcpy2D 进行复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 CUDA 的新手,感谢您的帮助,希望您能帮助我.

I'm new in CUDA, I appreciate your help and hope you can help me.

我需要将二维数组的多个元素存储到一个向量中,然后对向量进行处理,但是我的代码效果不好,调试的时候发现在设备中用<分配二维数组有错误code>cudaMallocPitch 并使用 cudaMemcpy2D 复制到该数组.这是我的代码:

I need to store multiple elements of a 2D array into a vector, and then work with the vector, but my code does not work well, when I debug, I find a mistake in allocating the 2D array in the device with cudaMallocPitch and copying to that array with cudaMemcpy2D. This is my code:

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cmath>

#define maxThreads 96

__global__ void extract(int mSize, float* dev_vector, float* dev_matrix, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    while(idx<N)
    {
        dev_vector[idx] = *(dev_matrix+(mSize*idx+N));
        idx += blockDim.x * gridDim.x;
    }
}

int main()
{
    //CPU variables
    int mSize = 5;
    float* matrix;
    int N = 4; // Vector size
    int i,j;
    float* vector;
    int blocks, threads;

    float* dev_matrix;
    float* dev_vector;

    blocks = 1+((N-1)/maxThreads);
    threads = 1+((N-1)/blocks);

    unsigned long int pitch;
    unsigned long int memsize_vector = N*sizeof(float);
    unsigned long int memsize_matrix = mSize*sizeof(float);


    matrix = new float[memsize_matrix*memsize_matrix];
    vector = new float[memsize_vector];

    //Create 2D array
    for(i=0; i<mSize; i++)
        for(j=0; j<mSize; j++)
        {
            matrix[i+mSize*j] = ((i+1)+(j+1));
        }

    printf("
");
    for (i=0; i<mSize; i++){
        for(j=0; j<mSize; j++){
            printf("% 1.5f ", matrix[i+mSize*j]);
        }
        printf("
");
    }
    printf("
");


    cudaMallocPitch((void **)&dev_matrix, &pitch, memsize_matrix, mSize);
    cudaMalloc((void **)&dev_vector, memsize_vector);

    cudaMemcpy2D(dev_matrix, pitch, matrix, memsize_matrix, memsize_matrix, mSize,
                     cudaMemcpyHostToDevice);

    extract<<<blocks,threads>>>(mSize, dev_vector, dev_matrix, N);
    cudaDeviceSynchronize();

    cudaMemcpy(vector, dev_vector, memsize_vector, cudaMemcpyDeviceToHost);

    printf("Vector values are:
");
    for(i=0; i<N; i++)
        printf(" % 1.5f ", vector[i]);
    printf("
");

    cudaFree(dev_matrix);
    cudaFree(dev_vector);

}

推荐答案

这段代码有很多的问题,包括但不限于在几个地方互换使用以字节为单位的数组大小和字大小在代码中,使用不正确的类型(注意 size_t 的存在是有充分理由的)、潜在的截断和类型转换问题等等.

There are lots of problems in this code, including but not limited to using array sizes in bytes and word sizes interchangeably in several places in code, using incorrect types (note that size_t exists for a very good reason) , potential truncation and type casting problems, and more.

但核心问题是内核内的音高内存的寻址,您甚至从未将音高值传递给它.阅读 文档 cudaMallocPitch 将为您提供正确的方法来解决内核中的倾斜内存.您的内核可能如下所示:

But the core problem is the addressing of pitched memory inside the kernel, to which you are never even passing the pitch value. Reading the documentation for cudaMallocPitch will give you the correct method for addressing pitched memory inside a kernel. Your kernel might then look like this:

__global__ void extract(size_t mpitch, float* dev_vector, float* dev_matrix, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    while(idx<N)
    {          
        dev_vector[idx] = *(float *)( ((char*)dev_matrix + idx * mpitch) + N );
        idx += stride;
    }
}

[免责声明:从未编译或测试,使用风险自负].

[disclaimer: never compiled or tested, use at own risk].

您必须修复主机代码中的所有问题,以反映您所做的任何内核更改.

You will have to fix then all the problems in the host code to reflect whatever kernel changes you make.

这篇关于使用 cudaMallocPitch 分配二维数组并使用 cudaMemcpy2D 进行复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆