CUDA，使用2D和3D阵列 [英] CUDA, Using 2D and 3D Arrays

查看：102 发布时间：2020/7/17 19:39:20 c++ arrays cuda

本文介绍了CUDA，使用2D和3D阵列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在线上有很多关于在CUDA上分配，复制，索引等2d和3d数组的问题.我得到了很多矛盾的答案，所以我试图汇编过去的问题，看看我是否可以提出正确的问题.

There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.

第一个链接: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/

问题:分配2D指针数组

Problem: Allocating a 2d array of pointers

用户解决方案:使用mallocPitch

User solution: use mallocPitch

正确的"低效解决方案:在for循环中为每行使用malloc和memcpy(荒谬的开销)

"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)

更正确"的解决方案:将其压缩为1d数组专业意见"，有评论说没有人关注性能，而在gpu上使用2d指针结构

"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu

第二个链接:问题:在主机上分配空间并将其传递给设备

Problem: Allocating space on host and passing it to device

子链接:子链接解决方案:在GPU上对基于指针的结构进行编码是一种糟糕的体验，并且效率极低，因此将其压缩为一维数组.

Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.

第三链接:在CUDA中的设备内存上分配2D阵列

问题:分配和传输2d数组

Problem: Allocating and transferring 2d arrays

用户解决方案:使用mallocPitch

User solution: use mallocPitch

其他解决方案:将其展平

推荐答案

由于您的问题会编译其他问题的列表，因此我将通过编译其他问题的列表来进行回答.

Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.

cudaMallocPitch/cudaMemcpy2D:

首先，诸如cudaMallocPitch和cudaMemcpy2D之类的cuda运行时API函数实际上并不涉及双指针分配或2D(双下标)数组.只需查看文档此处是与此有关的许多问题之一. 此处是一个完全可行的示例用法.涉及与cudaMallocPitch/cudaMemcpy2d用法相关的各种概念的另一个示例是

First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.

动态分配的2D常规案例:

如果您想学习如何在CUDA内核中使用动态分配的2D数组(这意味着您可以使用双下标访问，例如data[x][y])，则

If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:

还有其他非同寻常的复杂性
由于数据访问需要解引用2个指针而不是1个指针，因此访问通常不如1D访问有效.

(请注意，分配对象数组(其中的对象具有指向动态分配的嵌入式指针)在本质上与2D数组概念相同，并且

(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)

展平:

如果您认为必须使用常规2D方法，那么就可以了(尽管有时这里是讨论展平"的许多问题/答案之一.

If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".

动态分配的常规3D案例:

当我们将其扩展到3个(或更高！)尺寸时，IMO的一般情况变得过于复杂.额外的复杂性应强烈激励我们寻求替代方案.三次下标的一般情况涉及到在实际检索数据之前进行3次指针访问，因此效率甚至更低. 此处是一个完整的示例(第二个代码示例).

As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).

特殊情况:编译时已知的数组宽度:

请注意，如果数组尺寸( width ，在二维数组或3个数组中的2个)中，则应将其视为特殊情况 3D数组的尺寸)在编译时是已知的.在这种情况下，使用适当的辅助类型定义，我们可以指示"编译器应如何计算索引，并且在这种情况下，我们可以使用双下标访问方式，而复杂度要比一般情况下和不会因指针追逐而造成效率损失.仅需取消引用一个指针即可检索数据(与数组维数无关，如果在编译时已知n维数组的n-1维，则不考虑数组维数).已经提到的答案此处(第一个代码示例)给出了在3D情况下的完整示例，并给出了答案

Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.

使用双下标的主机代码，使用单下标的设备代码:

最后，另一种方法选项允许我们轻松地在主机代码中混合使用2D(双下标)访问，而在中仅使用1D(单下标，可能带有模拟2D"访问)设备代码.一个有效的示例是 cudaMemcpy操作.)

Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).

从上述方法中，您需要选择一种适合您的胃口和需求的方法.没有一个适合所有情况的推荐.

From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

这篇关于CUDA，使用2D和3D阵列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CUDA，使用2D和3D阵列 [英] CUDA, Using 2D and 3D Arrays

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

CUDA，使用2D和3D阵列 [英] CUDA, Using 2D and 3D Arrays

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭