CUDA,使用 2D 和 3D 数组 [英] CUDA, Using 2D and 3D Arrays

查看:41
本文介绍了CUDA,使用 2D 和 3D 数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

网上有很多关于在 CUDA 上分配、复制、索引等 2d 和 3d 数组的问题.我得到了很多相互矛盾的答案,所以我试图汇编过去的问题,看看我是否可以提出正确的问题.

第一个链接:https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/

问题:分配一个二维指针数组

用户解决方案:使用mallocPitch

正确"低效解决方案:在 for 循环中为每一行使用 malloc 和 memcpy(荒谬的开销)

更正确"的解决方案:将其压缩成一维数组专业意见",一条评论说没有人关注性能在 gpu 上使用二维指针结构

第二个链接:https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-分配空间在主机和传递到设备-/

问题:在主机上分配空间并将其传递给设备

子链接:https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/

子链接解决方案:在GPU上编码基于指针的结构是一种糟糕的体验,而且效率极低,将其压缩成一维数组.

第三个链接:在 CUDA 中的设备内存上分配二维数组

问题:分配和传输二维数组

用户解决方案:使用mallocPitch

其他解决方案:将其压平

第四个链接:如何在 CUDA 中使用二维数组?

问题:分配和遍历二维数组

提交的解决方案:不显示分配

其他解决方案:挤压它

有很多其他来源大多都在说同样的话,但在多个实例中,我看到有关 GPU 上指针结构的警告.

许多人声称分配指针数组的正确方法是为每一行调用 malloc 和 memcpy,但函数 mallocPitch 和 memcpy2D 存在.这些功能在某种程度上效率较低吗?为什么这不是默认答案?

二维数组的另一个正确"答案是将它们压缩成一个数组.我应该习惯这一点吗?我对我的代码非常挑剔,对我来说感觉不雅.

我正在考虑的另一个解决方案是最大化使用一维指针数组的矩阵类,但我找不到实现双括号运算符的方法.

也根据此链接:将对象复制到设备?

和子链接答案:cudaMemcpy 分段错误

这有点不确定.

我想使用 CUDA 的类都有 2/3d 数组,将它们转换为 CUDA 的 1d 数组会不会有很多开销?

我知道我问了很多,但总而言之,我应该习惯将数组压缩作为生活的事实,还是我可以使用 2d 分配和复制函数而不会像在调用 alloc 和 cpy 的解决方案中那样产生糟糕的开销在 for 循环中?

解决方案

由于您的问题汇总了其他问题的列表,因此我将通过汇总其他答案的列表来回答.

cudaMallocPitch/cudaMemcpy2D:

首先,像 cudaMallocPitchcudaMemcpy2D 这样的 cuda 运行时 API 函数实际上并不涉及双指针分配或 2D(双下标)数组.只需查看 此处是关于此的众多问题之一.此处是一个完整的示例用法.另一个涵盖与 cudaMallocPitch/cudaMemcpy2d 用法相关的各种概念的示例是 此处.相反,考虑这些的正确方法是它们与倾斜分配一起工作.此外,当使用一组 malloc(或 new 或类似)操作创建底层分配时,您不能使用 cudaMemcpy2D 传输数据在一个循环中.这种主机数据分配结构特别不适合处理设备上的数据.

一般动态分配的二维情况:

如果您想学习如何在 CUDA 内核中使用动态分配的二维数组(意味着您可以使用双下标访问,例如 data[x][y]),那么 cuda 标签信息页面 包含规范"标签.对此的问题是 这里.那里的 talonmies 给出的答案包括适当的机制以及适当的警告:

  • 存在额外的、非平凡的复杂性
  • 访问通常比一维访问效率低,因为数据访问需要取消引用 2 个指针,而不是 1 个.

(请注意,分配对象数组,其中对象具有指向动态分配的嵌入式指针,本质上与二维数组概念相同,并且 您在问题中链接的示例是对此的合理演示)

另外,这里是一种推力方法用于构建通用的动态分配的二维数组.

扁平化:

如果你认为你必须使用一般的2D方法,那就继续吧,这不是不可能的(虽然有时人们在这个过程中挣扎!)然而,由于增加了复杂性和降低了效率,规范的建议"不再适用.这里是扁平化"您的存储方法,并使用模拟"二维访问.这里是讨论扁平化"的众多问题/答案之一.>

一般动态分配的 3D 案例:

当我们将其扩展到 3 个(或更高!)维度时,一般情况变得过于复杂,难以处理,IMO.额外的复杂性应该强烈激励我们寻求替代方案.三重下标的一般情况在实际检索数据之前涉及 3 次指针访问,因此效率更低.这里是一个完整的示例(第二个代码示例).

特殊情况:编译时已知的数组宽度:

请注意,当数组维度(宽度,在二维数组的情况下,或 33D 数组的维度)在编译时已知.在这种情况下,通过适当的辅助类型定义,我们可以指示"编译器应该如何计算索引,在这种情况下,我们可以使用双下标访问,其复杂性远低于一般情况, 不会因指针追逐而降低效率.只需要取消引用一个指针即可检索数据(无论数组维数如何,如果在编译时已知 n-1 维数组的 n 维数组).已经提到的答案中的第一个代码示例这里(第一个代码示例)在 3D 情况下给出了一个完整的示例,答案 此处 给出了这种特殊情况的二维示例.

双下标主机代码、单下标设备代码:

最后,另一种方法选项允许我们在 主机代码中轻松混合 2D(双下标)访问,而在 中仅使用 1D(单下标,可能带有模拟 2D"访问)>设备代码.一个有效的例子是 此处.通过将底层分配组织为连续分配,然后构建指针树",我们可以在主机上启用双下标访问,并且仍然可以轻松地将平面分配传递给设备.尽管该示例没有显示它,但可以扩展此方法以基于平面分配和手动创建的指针树"在设备上创建双下标访问系统,但是这将具有大致相同的上面给出的 2D 通用动态分配方法的问题:它会涉及双指针(双取消引用)访问,因此效率较低,并且存在与构建指针树"相关联的一些复杂性,以用于设备代码(例如它可能需要额外的 cudaMemcpy 操作).

从上述方法中,您需要选择一种适合您的胃口和需要的方法.没有一种建议适合所有可能的情况.

There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.

First link: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/

Problem: Allocating a 2d array of pointers

User solution: use mallocPitch

"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)

"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu

Second link: https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-allocate-space-in-host-and-pass-to-device-/

Problem: Allocating space on host and passing it to device

Sub link: https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/

Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.

Third link: Allocate 2D Array on Device Memory in CUDA

Problem: Allocating and transferring 2d arrays

User solution: use mallocPitch

Other solution: flatten it

Fourth link: How to use 2D Arrays in CUDA?

Problem: Allocate and traverse 2d arrays

Submitted solution: Does not show allocation

Other solution: squash it

There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.

Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?

The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.

Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.

Also according to this link: Copy an object to device?

and the sub link answer: cudaMemcpy segmentation fault

This gets a little iffy.

The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?

I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?

解决方案

Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.

cudaMallocPitch/cudaMemcpy2D:

First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.

general, dynamically allocated 2D case:

If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:

  • there is additional, non-trivial complexity
  • the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.

(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)

Also, here is a thrust method for building a general dynamically allocated 2D array.

flattening:

If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".

general, dynamically allocated 3D case:

As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).

special case: array width known at compile time:

Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.

doubly-subscripted host code, singly-subscripted device code:

Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).

From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

这篇关于CUDA,使用 2D 和 3D 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆