CUDA:将1D阵列从GPU复制到主机上的2D阵列 [英] CUDA: Copy 1D array from GPU to 2D array on host

查看:132
本文介绍了CUDA:将1D阵列从GPU复制到主机上的2D阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

int main() {
    char** hMat,* dArr;

    hMat = new char*[10];
    for (int i=0;i<10;i++) {
        hMat[i] = new char[10];
    }
    cudaMalloc((void**)&dArr,100);

    // Copy from dArr to hMat here:

}

我在GPU上有一个数组 dArr ,我想将其复制到一个2D数组 hMat 在主机上,将GPU阵列中的前10个字段复制到主机矩阵的第一行,然后将后10个字段复制到第二行,依此类推。

I have an array, dArr on the GPU, and I want to copy it into a 2D array hMat on the host, where the first 10 fields in the GPU array are copied to the first row in the host matrix, and the next 10 fields are copied to the second row, and so on.

文档中有一些功能,即 CudaMemcpy2D CudaMemcpy2DFromArray ,但我不太确定应如何使用它们。

There are some functions in the documentation, namely CudaMemcpy2D and CudaMemcpy2DFromArray, but I'm not quite sure how they should be used.

推荐答案

您的分配方案(单独分配的指针数组)有可能在主机上创建不连续的分配。没有任何类型的 cudaMemcpy 操作(包括您提到的操作)可以针对任意不连续区域,您的分配方案有可能创建该区域。

Your allocation scheme (an array of pointers, separately allocated) has the potential to create a discontiguous allocation on the host. There are no cudaMemcpy operations of any type (including the ones you mention) that can target an arbitrarily discontiguous area, which your allocation scheme has the potential to create.

总而言之,您的方法很麻烦。它可以工作,但需要循环执行复制-实际上, 2D数组的每个行都执行一个 cudaMemcpy 操作。如果您选择这样做,那么大概不需要帮助。这很简单。

In a nutshell, then, your approach is troublesome. It can be made to work, but will require a loop to perform the copying -- essentially one cudaMemcpy operation per "row" of your "2D array". If you choose to do that, presumably you don't need help. It's quite straightforward.

我的建议是,您改为修改主机分配以创建基础的连续分配。可以通过单个普通的 cudaMemcpy 调用来处理这样的区域,但是您仍然可以在主机代码中将其视为 2D数组。

What I will suggest is that you instead modify your host allocation to create an underlying contiguous allocation. Such a region can be handled by a single, ordinary cudaMemcpy call, but you can still treat it as a "2D array" in host code.

基本思想是创建一个具有正确总体大小的单个分配,然后创建一个指向该单个分配中特定位置的指针,每个行应从该位置开始。然后,使用您的初始双指针引用该指针数组。

The basic idea is to create a single allocation of the correct overall size, then to create a set of pointers to specific places within the single allocation, where each "row" should start. You then reference into this pointer array using your initial double-pointer.

类似这样的东西:

#include <stdio.h>

typedef char mytype;

int main(){

  const int rows = 10;
  const int cols = 10;

  mytype **hMat = new mytype*[rows];
  hMat[0] = new mytype[rows*cols];
  for (int i = 1; i < rows; i++) hMat[i] = hMat[i-1]+cols;

  //initialize "2D array"

  for (int i = 0; i < rows; i++)
    for (int j = 0; j < cols; j++)
      hMat[i][j] = 0;

  mytype *dArr;
  cudaMalloc(&dArr, rows*cols*sizeof(mytype));

  //copy to device
  cudaMemcpy(dArr, hMat[0], rows*cols*sizeof(mytype), cudaMemcpyHostToDevice);

  //kernel call


  //copy from device
  cudaMemcpy(hMat[0], dArr, rows*cols*sizeof(mytype), cudaMemcpyDeviceToHost);

  return 0;
}

这篇关于CUDA:将1D阵列从GPU复制到主机上的2D阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆