CUDA - memcpy2d - 错误音高 [英] CUDA - memcpy2d - wrong pitch

查看:191
本文介绍了CUDA - memcpy2d - 错误音高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始CUDA编程,并试图执行下面显示的代码。想法是将二维数组复制到设备,计算所有元素的总和,然后检索总和(我知道这个算法不是并行化的,事实上它是做更多的工作,然后是必要的。作为memcopy的实践)。

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).

#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>

#define height 50
#define width 50

using namespace std;

// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;    
for (int r = 0; r < height; ++r) {
        int* row = (int*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
             int element = row[c];
             tempsum = tempsum + element;
        }
    }
*sum = tempsum;
}

//Host Code
int main()
{

int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));

cout << *sumhost << endl;

float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);

cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);

cout << *sumhost << endl;

return 0;
}

这段代码编译正好(在4.0 sdk版本候选)。但是,一旦我尝试执行,我得到

This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get

0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.

这是不幸的,因为我不知道如何解决它;-(知道,音调是存储器中的一个偏移量,以允许更快的数据复制,但是这样的音高仅用于设备存储器,而不是主机存储器,因此,我的主机存储器的音高应为0,不应该吗?

Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?

此外,我还想问两个其他问题:

Moreover I would also like to ask two other questions:


  • 如果我声明一个变量,如int * sumhost(见上文),这个指针指向?首先到主机内存,然后cudaMalloc到设备内存?

  • cutilCheckMsg在这种情况下非常方便。是否有类似的调试功能我应该知道?

推荐答案

在这行代码中:

cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);

你说的是 等于 0 ,但是当pitch的公式为 T * elem =(T * )((char *)base_address + row * pitch)+ column ?如果我们在该公式中替换了 0 的值,那么当在一些二维(x,y)有序对上查找地址时,将无法获得正确的值抵消。需要考虑的一点是,音高值的规则是 pitch = width + padding 。在主机上,填充通常等于 0 ,但是宽度不是 0 ,除非数组。在硬件方面可能有额外的填充,这是为什么pitch的值可能不等于声明的数组的宽度。因此,根据填充值,您可以得出结论: pitch> = width 。因此,即使在主机端,源音高的值至少应为每行的大小(以字节为单位),这意味着在 testarray 的情况下, code> 8 * sizeof(int)。最后,主机中的2D数组的高度也只有 2 行,而不是 4

you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.

如果你使用 malloc()分配一个指针,被给予驻留在主机存储器中的地址值。因此,您可以在主机端解除引用,但不能在设备端解除引用。另一方面,分配有 cudaMalloc()的指针被赋予指向存在于设备上的内存的指针。因此,如果您在主机上取消引用它,它不会指向主机上分配的内存,从而导致不可预测的结果。将该指针地址传递给设备上的内核是可以的,因为当它在设备端被引用时,它指向设备本地可访问的内存。总的来说,CUDA运行时保持这两个存储单元分离,提供将在设备和主机之间来回复制的存储器复制功能,并且使用来自这些指针的地址值作为用于复制的源和目的地,这取决于期望的方向(主机到设备或设备到主机)。现在如果你使用相同的 int * ,并且首先使用 malloc()分配它,然后 cudaMalloc(),你的指针将首先有一个指向主机内存的地址。 free() ,然后设备内存。您必须跟踪其状态,以避免取消引用设备或主机上的地址,从而导致不可预测的结果,具体取决于是否在主机代码或设备代码中取消引用。

As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.

这篇关于CUDA - memcpy2d - 错误音高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆