如何和何时应该使用pitched指针与cuda API? [英] How and when should I use pitched pointer with the cuda API?

查看:272
本文介绍了如何和何时应该使用pitched指针与cuda API?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对如何使用 cudaMalloc() cudaMemcpy()分配和复制线性内存有很好的理解。然而,当我想使用CUDA函数来分配和复制2D或3D矩阵时,我经常被各种参数所困扰,特别是关于在处理2D / 3D数组时总是出现的倾斜指针。这个文档很好地提供了一些关于如何使用它们的例子,但它假设我熟悉padding和pitch的概念,我不是。

I have quite a good understanding about how to allocate and copy linear memory with cudaMalloc() and cudaMemcpy(). However, when I want to use the CUDA functions to allocate and copy 2D or 3D matrices, I am often befuddled by the various arguments, especially concerning pitched pointers which are always present when dealing with 2D/3D arrays. The documentation is good for providing a couple examples on how to use them but it assumes that I am familiar with the notion of padding and pitch, which I am not.

我通常最后调整我在文档或网络上的其他地方找到的各种示例,但是下面的盲调试是相当痛苦,所以我的问题是:

I usually end up tweaking the various examples I find in the documentation or somewhere else on the web, but the blind debugging that follows is quite painful, so my question is:

是一个音高?我如何使用它?如何在CUDA中分配和复制2D和3D数组?

What is a pitch? How do I use it? How do I allocate and copy 2D and 3D arrays in CUDA?

推荐答案

这里是关于cuda中pitched pointer和padding的解释。

Here is an explanation about pitched pointer and padding in cuda.

首先,让我们从非线性内存的存在开始。当使用cudaMalloc分配内存时,结果就像是使用malloc的分配,我们有一个大小指定的连续内存块,我们可以放置任何我们想要的内存。如果我们想要分配一个10000浮点的向量,我们只需:

First, lets start with the reason for the existence of non linear memory. When allocating memory with cudaMalloc, the result is like an allocation with malloc, we have a contiguous memory chunk of the size specified and we can put anything we want in it. If we want to allocate a vector of 10000 float, we simply do:

float* myVector;
cudaMalloc(&myVector,10000*sizeof(float));

,然后通过经典索引访问myVector的第i个元素:

and then access ith element of myVector by classic indexing:

float element = myVector[i];

如果我们要访问下一个元素,我们只需:

and if we want to access the next element, we just do:

float next_element = myvector[i+1];

它工作得很好,因为访问第一个元素旁边的元素是(因为我不是

It works very fine because accessing an element right next to the first one is (for reasons I am not aware of and I don't wish to be for now) cheap.

当我们使用我们的内存作为一个二维数组时,事情变得有点不同。假设我们的10000个浮点向量实际上是一个100x100的数组。我们可以使用相同的cudaMalloc函数来分配它,如果我们想读取第i行,我们可以这样做:

Things become a little bit different when we use our memory as a 2D array. Lets say our 10000 float vector is in fact a 100x100 array. We can allocate it by using the same cudaMalloc function, and if we want to read the ith row, we do:

float* myArray;
cudaMalloc(&myArray,10000*sizeof(float));
int row[100]; // number of columns
for (int j=0; j<100; ++j)
    row[j] = myArray[i*100+j];



字对齐



读取存储器从myArray + 100 * i到myArray + 101 * i-1。它将采取的存储器访问操作的数量取决于该行占用的存储器字的数量。存储器字中的字节数取决于实现。为了在读取单个行时最小化存储器访问的数量,我们必须确保我们在字的开始处开始行,因此我们必须为每一行填充存储器,直到一个新的行开始。

Word alignment

So we have to read memory from myArray+100*i to myArray+101*i-1. The number of memory access operation it will take depends on the number of memory words this row takes. The number of bytes in a memory word depends on the implementation. To minimize the number of memory accesses when reading a single row, we must assure that we start the row on the start of a word, hence we must pad the memory for every row until the start of a new one.

填充数组的另一个原因是cuda中的银行mecanism,涉及共享内存访问。当阵列在共享存储器中时,它被分成几个存储体。两个cuda线程可以同时访问它们,只要它们不访问属于同一存储器组的存储器。由于我们通常希望并行处理每一行,因此我们可以确保我们可以通过将每行填充到新库的开头来进行模拟访问。

Another reason for padding arrays is the bank mecanism in cuda, concerning shared memory access. When the array is in the shared memory, it is split into several memory banks. Two cuda threads can access it simultaneously, provided they don't access memory belonging to the same memory bank. Since we usually want to treat each row in parallel, we can ensure that we can access it simulateously by padding each row to the start of a new bank.

现在,使用cudaMalloc分配二维数组,我们将使用cudaMallocPitched:

Now, instead of allocating the 2D array with cudaMalloc, we will use cudaMallocPitched:

size_t pitch;
float* myArray;
cudaMallocPitch(&myArray,&pitch,100,100);

请注意,这里的音高是函数的返回值:cudaMallocPitch检查应该在系统并返回适当的值。 cudaMallocPitch的作用如下:

Note that the pitch here is the return value of the function: cudaMallocPitch checks what it should be on your system and returns the appropriate value. What cudaMallocPitch does is the following:


  1. 分配第一行。

  2. 检查

  3. 如果没有,则分配更多的字节以达到128的下一个倍数。

  4. 请重新检查每一行。

  1. Allocate the first row.
  2. Check if the number of bytes allocated makes it correctly aligned (ie it is a multiple of 128).
  3. If not, allocate further bytes to reach the next multiple of 128. the pitch is then the number of bytes allocated for a single row, including the extra bytes (padding bytes).
  4. Reiterate for each row.

最后,我们通常分配更多的内存,因为每行现在是音调的大小,而不是大小w * sizeof(float) 。

At the end, we have typically allocated more memory than necessary because each row is now the size of pitch, and not the size of w*sizeof(float).

但是现在,当我们想访问列中的下一个元素时,我们必须这样做:

But now, when we want to access the next element in a column, we must do:

float next_column_element = myArray[(j+1)*pitch+i];

两个连续列之间的字节偏移量不能从数组的大小推导出来,是为什么我们要保持由cudaMallocPitch返回的音调。由于间距是填充大小的倍数(通常是字大小和排组大小中的最大值),所以它工作得很好。 Yay。

The offset in bytes between two successive columns can no more be deduced from the size of our array, that is why we want to keep the pitch returned by cudaMallocPitch. And since pitch is a multiple of the padding size (typically, the biggest of word size and bank size), it works great. Yay.

现在我们知道如何创建和访问单个元素在由cudaMallocPitch创建的数组中,我们可能想要将它的整个部分复制到其他内存中,或者从其他内存复制,不是线性的。

Now that we know how to create and access single element in an array created by cudaMallocPitch, we might want to copy whole part of it to and from other memory, linear or not.

我们想复制一个数组在我们的主机上用malloc分配的100x100数组中:

Lets say we want to copy a our array in a 100x100 array allocated on our host with malloc:

float* host_memory = (float*)malloc(100*100*sizeof(float));

如果我们使用cudaMemcpy,我们将复制所有分配给cudaMallocPitch的内存,行。我们必须做的,以避免填充内存做是逐个复制每行。我们可以手动执行:

If we use cudaMemcpy, we will copy all the memory allocated with cudaMallocPitch, including the padded bytes between each rows. What we must do to avoid padding memory do is copy each row one by one. We can do it manually:

for (size_t i=0;i<100;++i) {
cudaMemcpy(host_memory[i*100],myArray[pitch*i],
    100*sizeof(float),cudaMemcpyDeviceToHost);
}

或者我们可以告诉cuda API我们只需要有用的内存我们为它的方便分配了填充字节的内存,所以如果它可以自动处理它自己的混乱,这将是非常好的,谢谢。这里输入cudaMemcpy2D:

Or we can tell to the cuda API that we want only the useful memory from the memory we allocated with padding bytes for its convenience so if it could deal with its own mess automatically it would be very nice indeed, thank you. And here enters cudaMemcpy2D:

cudaMemcpy2D(host_memory,100*sizeof(float)/*destination pitch*/,myArray,pitch,
100*sizeof(float)/*width*/,100/*heigth*/,cudaMemcpyDeviceToHost);

现在复制将自动完成。它将复制在width(这里:100xsizeof(float)),高度时间(这里:100)中指定的字节数,每次跳到下一行时跳过 pitch 字节。注意,我们还必须提供目的地内存的音高,因为它也可以被填充。这里它不是,所以音高等于非填充数组的音高:它是一行的大小。还要注意,memcpy函数中的width参数以字节表示,但是高度参数以元素数目表示。这是因为复制的方式,在某种程度上像我写的上面的手动副本:width是沿着一行(每个副本在内存中连续的elemnts)的大小,而heigth是该操作必须的次数完成。

Now the copy will be done automatically. It will copy the number of bytes specified in width (here: 100xsizeof(float)), heigth time (here: 100), skipping pitch bytes every time it jumps to a next row. Note that we must still provide the pitch for the destination memory because it could be padded, too. Here it is not, so the pitch is equal to the pitch of a non-padded array: it is the size of a row. Note also that the width parameter in the memcpy function is expressed in bytes, but the heigth parameter is expressed in number of elements. That is because of the way the copy is done, someway like I wrote the manual copy above: the width is the size of each copy along a row (elemnts that are contiguous in memory) and the heigth is the number of times this operation must be accomplished. (These inconsistencies in units, as a physicist, annoys me very much.)

3D数组与2D数组没有什么不同,没有包括额外的填充。 3D阵列只是填充行的2D 古典数组。这就是为什么当分配一个3D数组时,你只能得到一个音高,这是在一行的连续点之间的字节计数的差异。如果您想要访问深度维度中的连续点,则可以安全地将音高乘以列数,从而得到slicePitch。

3D arrays are no different that 2D arrays actually, there is no additional padding included. A 3D array is just a 2D classical array of padded rows. That is why when allocating a 3D array, you only get one pitch that is the difference in bytes count between to successive points along a row. If you want to access to successive points along the depth dimension, you can safely multiply the pitch by the number of columns, which gives you the slicePitch.

cuda api访问3D存储器与2D存储器的存储器不同,但是想法是一样的:

The cuda api for accessing 3D memory is sligthly different than the one for 2D memory, but the idea is the same :


  • 使用cudaMalloc3D时,

  • 复制3D内存块时,除非复制单行,否则不能使用cudaMemcpy。

  • 当您将数据复制到线性内存或从线性内存复制数据时,您必须为您的数据提供一个音调

  • 尺寸参数以行为大小的字节表示,而元素的数量以字节表示。列和深度维度。

  • When using cudaMalloc3D, you receive a pitch value that you must carefully keep for subsequent access to the memory.
  • When copying a 3D memory chunk, you cannot use cudaMemcpy unless you are copying a single row. You must use any other kind of copy utlity provided by the cuda utility that takes the pitch into account.
  • When you copy your data to/from linear memory, you must provide a pitch to your pointer even though it is irrelevant : this pitch is the size of a row, expressed in bytes.
  • The size parameters are expressed in bytes for the row size, and in number of elements for the column and depth dimension.

这篇关于如何和何时应该使用pitched指针与cuda API?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆