CUDA，使用 2D 和 3D 数组 [英] CUDA, Using 2D and 3D Arrays

首先，像 cudaMallocPitch 和 cudaMemcpy2D 这样的 cuda 运行时 API 函数实际上并不涉及双指针分配或 2D(双下标)数组.只需查看此处是关于此的众多问题之一.此处是一个完整的示例用法.另一个涵盖与 cudaMallocPitch/cudaMemcpy2d 用法相关的各种概念的示例是此处.相反，考虑这些的正确方法是它们与倾斜分配一起工作.此外，当使用一组 malloc(或 new 或类似)操作创建底层分配时，您不能使用 cudaMemcpy2D 传输数据在一个循环中.这种主机数据分配结构特别不适合处理设备上的数据.

一般动态分配的二维情况:

如果您想学习如何在 CUDA 内核中使用动态分配的二维数组(意味着您可以使用双下标访问，例如 data[x][y])，那么 cuda 标签信息页面包含规范"标签.对此的问题是这里.那里的 talonmies 给出的答案包括适当的机制以及适当的警告:

存在额外的、非平凡的复杂性
访问通常比一维访问效率低，因为数据访问需要取消引用 2 个指针，而不是 1 个.

(请注意，分配对象数组，其中对象具有指向动态分配的嵌入式指针，本质上与二维数组概念相同，并且您在问题中链接的示例是对此的合理演示)

另外，这里是一种推力方法用于构建通用的动态分配的二维数组.

扁平化:

如果你认为你必须使用一般的2D方法，那就继续吧，这不是不可能的(虽然有时人们在这个过程中挣扎！)然而，由于增加了复杂性和降低了效率，规范的建议"不再适用.这里是扁平化"您的存储方法，并使用模拟"二维访问.这里是讨论扁平化"的众多问题/答案之一.>

一般动态分配的 3D 案例:

当我们将其扩展到 3 个(或更高！)维度时，一般情况变得过于复杂，难以处理，IMO.额外的复杂性应该强烈激励我们寻求替代方案.三重下标的一般情况在实际检索数据之前涉及 3 次指针访问，因此效率更低.这里是一个完整的示例(第二个代码示例).

特殊情况:编译时已知的数组宽度:

请注意，当数组维度(宽度，在二维数组的情况下，或 33D 数组的维度)在编译时已知.在这种情况下，通过适当的辅助类型定义，我们可以指示"编译器应该如何计算索引，在这种情况下，我们可以使用双下标访问，其复杂性远低于一般情况，不会因指针追逐而降低效率.只需要取消引用一个指针即可检索数据(无论数组维数如何，如果在编译时已知 n-1 维数组的 n 维数组).已经提到的答案中的第一个代码示例这里(第一个代码示例)在 3D 情况下给出了一个完整的示例，答案此处给出了这种特殊情况的二维示例.

双下标主机代码、单下标设备代码:

最后，另一种方法选项允许我们在 主机代码中轻松混合 2D(双下标)访问，而在 中仅使用 1D(单下标，可能带有模拟 2D"访问)>设备代码.一个有效的例子是此处.通过将底层分配组织为连续分配，然后构建指针树"，我们可以在主机上启用双下标访问，并且仍然可以轻松地将平面分配传递给设备.尽管该示例没有显示它，但可以扩展此方法以基于平面分配和手动创建的指针树"在设备上创建双下标访问系统，但是这将具有大致相同的上面给出的 2D 通用动态分配方法的问题:它会涉及双指针(双取消引用)访问，因此效率较低，并且存在与构建指针树"相关联的一些复杂性，以用于设备代码(例如它可能需要额外的 cudaMemcpy 操作).

从上述方法中，您需要选择一种适合您的胃口和需要的方法.没有一种建议适合所有可能的情况.

There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.

First link: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/

Problem: Allocating a 2d array of pointers

User solution: use mallocPitch

"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)

"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu

Second link: https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-allocate-space-in-host-and-pass-to-device-/

Problem: Allocating space on host and passing it to device

Sub link: https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/

Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.

Third link: Allocate 2D Array on Device Memory in CUDA

Problem: Allocating and transferring 2d arrays

User solution: use mallocPitch

Other solution: squash it

There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.

Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?

The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.

Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.

Also according to this link: Copy an object to device?

and the sub link answer: cudaMemcpy segmentation fault

This gets a little iffy.

The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?

I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?

解决方案

Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.

cudaMallocPitch/cudaMemcpy2D:

First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.

general, dynamically allocated 2D case:

If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:

there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.

(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)

Also, here is a thrust method for building a general dynamically allocated 2D array.

flattening:

If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".

general, dynamically allocated 3D case:

As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).

special case: array width known at compile time:

Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.

doubly-subscripted host code, singly-subscripted device code:

Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).

From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

这篇关于CUDA，使用 2D 和 3D 数组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CUDA，使用 2D 和 3D 数组 [英] CUDA, Using 2D and 3D Arrays

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

CUDA，使用 2D 和 3D 数组 [英] CUDA, Using 2D and 3D Arrays

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭