CUDA,使用2D和3D阵列 [英] CUDA, Using 2D and 3D Arrays

查看:102
本文介绍了CUDA,使用2D和3D阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在线上有很多关于在CUDA上分配,复制,索引等2d和3d数组的问题.我得到了很多矛盾的答案,所以我试图汇编过去的问题,看看我是否可以提出正确的问题.

There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.

第一个链接: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/

问题:分配2D指针数组

Problem: Allocating a 2d array of pointers

用户解决方案:使用mallocPitch

User solution: use mallocPitch

正确的"低效解决方案:在for循环中为每行使用malloc和memcpy(荒谬的开销)

"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)

更正确"的解决方案:将其压缩为1d数组专业意见",有评论说没有人关注性能,而在gpu上使用2d指针结构

"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu

第二个链接:问题:在主机上分配空间并将其传递给设备

Problem: Allocating space on host and passing it to device

子链接:子链接解决方案:在GPU上对基于指针的结构进行编码是一种糟糕的体验,并且效率极低,因此将其压缩为一维数组.

Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.

第三链接:在CUDA中的设备内存上分配2D阵列

问题:分配和传输2d数组

Problem: Allocating and transferring 2d arrays

用户解决方案:使用mallocPitch

User solution: use mallocPitch

其他解决方案:将其展平

Other solution: flatten it

第四个链接:如何在CUDA中使用2D阵列?

问题:分配和遍历2d数组

Problem: Allocate and traverse 2d arrays

提交的解决方案:不显示分配

Submitted solution: Does not show allocation

其他解决方案:将其压扁

Other solution: squash it

还有很多其他消息来源都在说同样的话,但是在多个实例中,我看到有关GPU上指针结构的警告.

There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.

许多人认为分配指针数组的正确方法是为每行调用malloc和memcpy,但存在函数mallocPitch和memcpy2D.这些功能的效率有所降低吗?为什么这不是默认答案?

Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?

二维数组的另一个正确"答案是将它们压缩成一个数组.我是否应该习惯生活中的事实?我对我的代码非常拘谨,这对我来说很不雅致.

The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.

我正在考虑的另一种解决方案是最大化使用1d指针数组的矩阵类,但是我找不到实现双括号运算符的方法.

Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.

还根据此链接:将对象复制到设备吗?

和子链接答案: cudaMemcpy细分错误

这有点浮躁.

我想全部使用CUDA的类都具有2/3d数组,将这些转换为CUDA的1d数组会不会有很多开销?

The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?

我知道我已经问了很多问题,但总的来说,我应该习惯于将数组压缩为现实,还是可以使用2d分配和复制功能而不会产生不好的开销(例如在调用alloc和cpy的解决方案中)在for循环中?

I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?

推荐答案

由于您的问题会编译其他问题的列表,因此我将通过编译其他问题的列表来进行回答.

Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.

cudaMallocPitch/cudaMemcpy2D:

首先,诸如cudaMallocPitchcudaMemcpy2D之类的cuda运行时API函数实际上并不涉及双指针分配或2D(双下标)数组.只需查看文档此处是与此有关的许多问题之一. 此处是一个完全可行的示例用法.涉及与cudaMallocPitch/cudaMemcpy2d用法相关的各种概念的另一个示例是

First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.

动态分配的2D常规案例:

如果您想学习如何在CUDA内核中使用动态分配的2D数组(这意味着您可以使用双下标访问,例如data[x][y]),则

If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:

  • 还有其他非同寻常的复杂性
  • 由于数据访问需要解引用2个指针而不是1个指针,因此访问通常不如1D访问有效.

(请注意,分配对象数组(其中的对象具有指向动态分配的嵌入式指针)在本质上与2D数组概念相同,并且

(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)

展平:

如果您认为必须使用常规2D方法,那么就可以了(尽管有时这里是讨论展平"的许多问题/答案之一.

If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".

动态分配的常规3D案例:

当我们将其扩展到3个(或更高!)尺寸时,IMO的一般情况变得过于复杂.额外的复杂性应强烈激励我们寻求替代方案.三次下标的一般情况涉及到在实际检索数据之前进行3次指针访问,因此效率甚至更低. 此处是一个完整的示例(第二个代码示例).

As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).

特殊情况:编译时已知的数组宽度:

请注意,如果数组尺寸( width ,在二维数组或3个数组中的2个)中,则应将其视为特殊情况 3D数组的尺寸)在编译时是已知的.在这种情况下,使用适当的辅助类型定义,我们可以指示"编译器应如何计算索引,并且在这种情况下,我们可以使用双下标访问方式,而复杂度要比一般情况下不会因指针追逐而造成效率损失.仅需取消引用一个指针即可检索数据(与数组维数无关,如果在编译时已知n维数组的n-1维,则不考虑数组维数).已经提到的答案此处(第一个代码示例)给出了在3D情况下的完整示例,并给出了答案

Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.

使用双下标的主机代码,使用单下标的设备代码:

最后,另一种方法选项允许我们轻松地在主机代码中混合使用2D(双下标)访问,而在中仅使用1D(单下标,可能带有模拟2D"访问)设备代码.一个有效的示例是cudaMemcpy操作.)

Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).

从上述方法中,您需要选择一种适合您的胃口和需求的方法.没有一个适合所有情况的推荐.

From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

这篇关于CUDA,使用2D和3D阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆