CUDA:将1D阵列从GPU复制到主机上的2D阵列 [英] CUDA: Copy 1D array from GPU to 2D array on host
问题描述
int main() {
char** hMat,* dArr;
hMat = new char*[10];
for (int i=0;i<10;i++) {
hMat[i] = new char[10];
}
cudaMalloc((void**)&dArr,100);
// Copy from dArr to hMat here:
}
我在GPU上有一个数组 dArr
,我想将其复制到一个2D数组 hMat
在主机上,将GPU阵列中的前10个字段复制到主机矩阵的第一行,然后将后10个字段复制到第二行,依此类推。
I have an array, dArr
on the GPU, and I want to copy it into a 2D array hMat
on the host, where the first 10 fields in the GPU array are copied to the first row in the host matrix, and the next 10 fields are copied to the second row, and so on.
文档中有一些功能,即 CudaMemcpy2D
和 CudaMemcpy2DFromArray
,但我不太确定应如何使用它们。
There are some functions in the documentation, namely CudaMemcpy2D
and CudaMemcpy2DFromArray
, but I'm not quite sure how they should be used.
推荐答案
您的分配方案(单独分配的指针数组)有可能在主机上创建不连续的分配。没有任何类型的 cudaMemcpy
操作(包括您提到的操作)可以针对任意不连续区域,您的分配方案有可能创建该区域。
Your allocation scheme (an array of pointers, separately allocated) has the potential to create a discontiguous allocation on the host. There are no cudaMemcpy
operations of any type (including the ones you mention) that can target an arbitrarily discontiguous area, which your allocation scheme has the potential to create.
总而言之,您的方法很麻烦。它可以工作,但需要循环执行复制-实际上, 2D数组的每个行都执行一个 cudaMemcpy
操作。如果您选择这样做,那么大概不需要帮助。这很简单。
In a nutshell, then, your approach is troublesome. It can be made to work, but will require a loop to perform the copying -- essentially one cudaMemcpy
operation per "row" of your "2D array". If you choose to do that, presumably you don't need help. It's quite straightforward.
我的建议是,您改为修改主机分配以创建基础的连续分配。可以通过单个普通的 cudaMemcpy
调用来处理这样的区域,但是您仍然可以在主机代码中将其视为 2D数组。
What I will suggest is that you instead modify your host allocation to create an underlying contiguous allocation. Such a region can be handled by a single, ordinary cudaMemcpy
call, but you can still treat it as a "2D array" in host code.
基本思想是创建一个具有正确总体大小的单个分配,然后创建一个指向该单个分配中特定位置的指针,每个行应从该位置开始。然后,使用您的初始双指针引用该指针数组。
The basic idea is to create a single allocation of the correct overall size, then to create a set of pointers to specific places within the single allocation, where each "row" should start. You then reference into this pointer array using your initial double-pointer.
类似这样的东西:
#include <stdio.h>
typedef char mytype;
int main(){
const int rows = 10;
const int cols = 10;
mytype **hMat = new mytype*[rows];
hMat[0] = new mytype[rows*cols];
for (int i = 1; i < rows; i++) hMat[i] = hMat[i-1]+cols;
//initialize "2D array"
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
hMat[i][j] = 0;
mytype *dArr;
cudaMalloc(&dArr, rows*cols*sizeof(mytype));
//copy to device
cudaMemcpy(dArr, hMat[0], rows*cols*sizeof(mytype), cudaMemcpyHostToDevice);
//kernel call
//copy from device
cudaMemcpy(hMat[0], dArr, rows*cols*sizeof(mytype), cudaMemcpyDeviceToHost);
return 0;
}
这篇关于CUDA:将1D阵列从GPU复制到主机上的2D阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!