cudaMemcpy分段故障 [英] cudaMemcpy segmentation fault

查看:986
本文介绍了cudaMemcpy分段故障的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这段错误发生在cudaMemcpy被调用时:

p>

  CurrentGrid-> cdata [i] = new float [size] 
cudaMemcpy(CurrentGrid-> cdata [i],Grid_dev-> cdata [i],size * sizeof(float),\
cudaMemcpyDeviceToHost);

CurrentGrid Grid_dev 分别是指向主机和设备上的网格类对象的指针,在此上下文中i = 0。类成员 cdata 是一个float类型的指针数组。对于调试,在此cudaMemcpy调用之前,我打印出 Grid_Dev-> cdata [i] 的每个元素的值, CurrentGrid- > cdata [i] Grid_dev-> cdata [i] ,值 size ,这一切看起来不错。但它仍然结束了分段故障(内核转储),这是唯一的错误消息。 cuda-memcheck只给进程没有成功终止。我现在不能使用cuda-gdb。任何建议,在哪里去?



UPDATE :看来现在我已经解决了这个问题由cudaMalloc另一个浮点指针A在设备和cudaMemcpy的值Grid_dev-> cdata [ i]到A,然后cudaMemcpy A主机。
因此,上面代码段成为:

  float * A; 
cudaMalloc((void **)& A,sizeof(float));
...
...
cudaMemcpy(& A,&(Grid_dev-> cdata [i]),sizeof(float *),cudaMemcpyDeviceToHost);
CurrentGrid-> cdata [i] = new float [size];
cudaMemcpy(CurrentGrid-> cdata [i],A,size * sizeof(float),cudaMemcpyDeviceToHost);

我这样做是因为valgrind弹出大小8的无效读取,我认为 Grid_dev-> cdata [i] 。我用gdb再次检查它,打印出 Grid_dev-> cdata [i] 的值为NULL。所以我想我不能直接解除引用设备指针,即使在这个cudaMemcpy调用。但为什么 ?根据此线程底部的注释,我们应该能够在cudaMemcpy函数中取消引用设备指针。



此外,我不知道cudaMalloc和cudaMemcpy如何工作的基本机制,我想通过cudaMalloc一个指针,说A这里,我们实际上分配这个指针指向设备上的某个地址。并通过cudaMemcpy将 Grid_dev-> cdata [i] 更改为A,如上面修改后的代码,我们重新指定指针A指向数组。那么,我们不会失去前一个地址的指针,当它是cudaMalloced的轨道?这可能导致内存泄漏或什么?如果是,我应该如何正确地解决这种情况?
感谢!



为了参考,我把下面这个错误发生的完整函数的代码。



非常感谢!

  __ global__ void Print(grid *,int); 
__global__ void Printcell(grid *,int);
void CopyDataToHost(param_t p,grid * CurrentGrid,grid * Grid_dev){

cudaMemcpy(CurrentGrid,Grid_dev,sizeof(grid),cudaMemcpyDeviceToHost);
#if DEBUG_DEV
cudaCheckErrors(cudaMemcpy1 error);
#endif
printf(\\\
Before copy cell data\ n);
打印<<< 1,1>>>(Grid_dev,0); //为
打印出一些Grid_dev信息cudaDeviceSynchronize(); // debug
int NumberOfBaryonFields = CurrentGrid-> ReturnNumberOfBaryonFields();
int size = CurrentGrid-> ReturnSize();
int vsize = CurrentGrid-> ReturnVSize();
CurrentGrid-> FieldType = NULL;
CurrentGrid-> FieldType = new int [NumberOfBaryonFields];
printf(CurrentGrid size is%d\\\
,size);
for(int i = 0; i< p.NumberOfFields; i ++){
CurrentGrid-> cdata [i] = NULL;
CurrentGrid-> vdata [i] = NULL;
CurrentGrid-> cdata [i] = new float [size];
CurrentGrid-> vdata [i] = new float [vsize];

Printcell<<< 1,1>>(Grid_dev,i); //打印元素值Grid_dev-> cdata [i]
cudaDeviceSynchronize ;

cudaMemcpy(CurrentGrid-> cdata [i],Grid_dev-> cdata [i],size * sizeof(float),\
cudaMemcpyDeviceToHost); // where error occurred
#if DEBUG_DEV
cudaCheckErrors(cudaMemcpy2 error);
#endif
printf(\\\
After copy cell data\ n);
打印<<< 1,1>>>(Grid_dev,i);
cudaDeviceSynchronize();
cudaMemcpy(CurrentGrid-> vdata [i],Grid_dev-> vdata [i],vsize * sizeof(float),\
cudaMemcpyDeviceToHost);
#if DEBUG_DEV
cudaCheckErrors(cudaMemcpy3 error);
#endif
}
cudaMemcpy(CurrentGrid-> FieldType,Grid_dev-> FieldType,\
NumberOfBaryonFields * sizeof(int),cudaMemcpyDeviceToHost);
#if DEBUG_DEV
cudaCheckErrors(cudaMemcpy4 error);
#endif
}

编辑:这里是valgrind的信息,

  == 19340 ==警告:设置地址范围perms:大范围[0x800000000,0xd00000000)(noaccess)
== 19340 ==警告:设置地址范围perms:大范围[0x200000000,0x400000000)(noaccess)
== 19340 ==无效读取大小8
== 19340 == at 0x402C79:CopyDataToHost(param_t,grid *,grid *)(CheckDevice.cu:48)
== 19340 == by 0x403646:CheckDevice(param_t,grid *,grid * )(CheckDevice.cu:186)
== 19340 == by 0x40A6CD:main(Transport.cu:81)
== 19340 ==地址0x2003000c0不是stack'd,malloc'd或者最近)free'd
== 19340 ==
== 19340 ==
== 19340 ==处理以信号11(SIGSEGV)的默认动作终止
== 19340 ==地址0x2003000C0处映射区域的权限不正确
== 19340 == at 0x402C79:CopyDataToHost(param_t,grid *,grid *)(CheckDevice.cu:48)
== 19340 == by 0x403646 :CheckDevice(param_t,grid *,grid *)(CheckDevice.cu:186)
== 19340 == by 0x40A6CD:main(Transport.cu:81)
== 19340 ==
== 19340 == HEAP SUMMARY:
== 19340 ==在退出时使用:2,611,365字节在5,017块
== 19340 ==总堆使用:5,879分配,862释放,4,332,278字节分配
== 19340 ==
== 19340 == LEAK摘要:
== 19340 ==绝对丢失:0个字节在0块
== 19340 ==间接丢失: 0字节在0块
== 19340 ==可能丢失:在274块中的37,416字节
== 19340 ==仍然可达:2,573,949字节在4,743块
== 19340 == suppressed: 0字节在0块
== 19340 ==重新运行--leak-check = full查看泄漏内存的详细信息
== 19340 ==
== 19340 ==对于计数检测和抑制的错误,重新运行:-v
== 19340 ==错误摘要:1个上下文中的1个错误(抑制:2从2)


解决方案

我相信我知道问题是什么,但要确认它,它将是有用的,看看你正在使用的代码在设备上设置 Grid_dev 类。



当类或其他数据结构用于设备,并且该类在其中具有指向存储器中的其他对象或缓冲区(可能在设备存储器中,用于将在设备上使用的类)的指针,则使得该顶级类在设备上可用的过程变得更复杂。



假设我有这样的类:

  class myclass {
int myval;
int * myptr;
}



我可以在主机上实例化上面的类,然后 malloc 一个数组 int ,并将该指针分配给 myptr 没事的。要使此类仅在设备和设备上可用,该过程可以类似。我可以:


  1. cudaMalloc指向将保存 myclass的设备内存的指针 li>
  2. (可选)使用cudaMemcpy将主机上的 myclass 的实例化对象复制到设备指针中

  3. 在设备上使用 malloc new myptr

如果我不想访问分配给 myptr 。但是如果我希望从主机中看到该存储,我需要一个不同的顺序:


  1. cudaMalloc将保存 myclass 的设备内存指针,让我们称为 mydevobj

  2. (可选)将主机上的 myclass 的实例化对象复制到步骤1中使用cudaMemcpy 的设备指针 mydevobj / li>
  3. 在主机上创建一个单独的int指针,我们称之为 myhostptr

  4. cudaMalloc <$设备上的 int 存储 myhostptr

  5. cudaMemcpy / em> myhostptr 从主机到设备指针&(mydevobj-> myptr)

之后,您可以 cudaMemcpy 嵌入式指针指向的数据 myptr 上分配的区域(通过 cudaMalloc )myhostptr



请注意,在步骤5中,因为我正在获取此指针位置的地址,所以此cudaMemcpy操作只需要 mydevobj 指针



设备指针的值 myint 然后将正确设置以执行您正在尝试的操作。如果你想要cudaMemcpy数据到和从 myint 到主机,你使用任何cudaMemcpy调用中的指针 myhostptr 不是 mydevobj-> myptr 。如果我们尝试使用 mydevobj-> myptr ,则需要解引用 mydevobj ,然后使用它来检索指针它存储在 myptr 中,然后使用该指针作为到/从位置的副本。这在主机代码中是不可接受的。如果你试图这样做,你会得到seg故障。 (注意,通过类比,我的 mydevobj 就像你的 Grid_dev 和我的 myptr 就像你的 cdata



总的来说,这是一个概念,第一次你遇到它,所以这样的问题想出了一些频率SO。您可能想要研究其中的一些问题来查看代码示例(因为您没有提供设置 Grid_dev 的代码):


  1. 示例1


  2. 示例3


I've been haunted by this error for quite a while so I decided to post it here.

This segmentation fault happened when a cudaMemcpy is called:

CurrentGrid->cdata[i] = new float[size];
cudaMemcpy(CurrentGrid->cdata[i], Grid_dev->cdata[i], size*sizeof(float),\
                cudaMemcpyDeviceToHost);

CurrentGrid and Grid_dev are pointer to a grid class object on host and device respectively and i=0 in this context. Class member cdata is a float type pointer array. For debugging, right before this cudaMemcpy call I printed out the value of each element of Grid_Dev->cdata[i], the address of CurrentGrid->cdata[i] and Grid_dev->cdata[i] and the value of size, which all looks good. But it still ends up with "Segmentation fault (core dumped)", which is the only error message. cuda-memcheck only gave "process didn't terminate successfully". I'm not able to use cuda-gdb at the moment. Any suggestion about where to go?

UPDATE: It seems now I have solved this problem by cudaMalloc another float pointer A on device and cudaMemcpy the value of Grid_dev->cdata[i] to A, and then cudaMemcpy A to host. So the segment of code written above becomes:

float * A;
cudaMalloc((void**)&A, sizeof(float));
...
...
cudaMemcpy(&A, &(Grid_dev->cdata[i]), sizeof(float *), cudaMemcpyDeviceToHost);    
CurrentGrid->cdata[i] = new float[size];
cudaMemcpy(CurrentGrid->cdata[i], A, size*sizeof(float), cudaMemcpyDeviceToHost);            

I did this because valgrind popped up "invalid read of size 8", which I thought referring to Grid_dev->cdata[i]. I checked it again with gdb, printing out the value of Grid_dev->cdata[i] being NULL. So I guess I cannot directly dereference the device pointer even in this cudaMemcpy call. But why ? According to the comment at the bottom of this thread , we should be able to dereference device pointer in cudaMemcpy function.

Also, I don't know the the underlying mechanism of how cudaMalloc and cudaMemcpy work but I think by cudaMalloc a pointer, say A here, we actually assign this pointer to point to a certain address on the device. And by cudaMemcpy the Grid_dev->cdata[i] to A as in the modified code above, we re-assign the pointer A to point to the array. Then don't we lose the track of the previous address that A pointed to when it is cudaMalloced? Could this cause memory leak or something? If yes, how should I work around this situation properly? Thanks!

For reference I put the code of the complete function in which this error happened below.

Many thanks!

__global__ void Print(grid *, int);
__global__ void Printcell(grid *, int);
void CopyDataToHost(param_t p, grid * CurrentGrid, grid * Grid_dev){

    cudaMemcpy(CurrentGrid, Grid_dev, sizeof(grid), cudaMemcpyDeviceToHost);
#if DEBUG_DEV
    cudaCheckErrors("cudaMemcpy1 error");
#endif
    printf("\nBefore copy cell data\n");
    Print<<<1,1>>>(Grid_dev, 0);            //Print out some Grid_dev information for 
    cudaDeviceSynchronize();                //debug 
    int NumberOfBaryonFields = CurrentGrid->ReturnNumberOfBaryonFields();
    int size = CurrentGrid->ReturnSize();
    int vsize = CurrentGrid->ReturnVSize();
    CurrentGrid->FieldType = NULL;
    CurrentGrid->FieldType = new int[NumberOfBaryonFields];
    printf("CurrentGrid size is %d\n", size);
    for( int i = 0; i < p.NumberOfFields; i++){
        CurrentGrid->cdata[i] = NULL;
        CurrentGrid->vdata[i] = NULL;
        CurrentGrid->cdata[i] = new float[size];
        CurrentGrid->vdata[i] = new float[vsize];

        Printcell<<<1,1>>>(Grid_dev, i);//Print out element value of Grid_dev->cdata[i]
        cudaDeviceSynchronize();        

        cudaMemcpy(CurrentGrid->cdata[i], Grid_dev->cdata[i], size*sizeof(float),\
                cudaMemcpyDeviceToHost);               //where error occurs
#if DEBUG_DEV
        cudaCheckErrors("cudaMemcpy2 error");
#endif
        printf("\nAfter copy cell data\n");
        Print<<<1,1>>>(Grid_dev, i);
        cudaDeviceSynchronize();
        cudaMemcpy(CurrentGrid->vdata[i], Grid_dev->vdata[i], vsize*sizeof(float),\
                cudaMemcpyDeviceToHost);
#if DEBUG_DEV
        cudaCheckErrors("cudaMemcpy3 error");
#endif
    }
    cudaMemcpy(CurrentGrid->FieldType, Grid_dev->FieldType,\
            NumberOfBaryonFields*sizeof(int), cudaMemcpyDeviceToHost);
#if DEBUG_DEV
    cudaCheckErrors("cudaMemcpy4 error");
#endif
}

EDIT: here is the information from valgrind, from which I'm trying to track down where the memory leak happened.

==19340== Warning: set address range perms: large range [0x800000000, 0xd00000000) (noaccess)
==19340== Warning: set address range perms: large range [0x200000000, 0x400000000) (noaccess)
==19340== Invalid read of size 8
==19340==    at 0x402C79: CopyDataToHost(param_t, grid*, grid*) (CheckDevice.cu:48)
==19340==    by 0x403646: CheckDevice(param_t, grid*, grid*) (CheckDevice.cu:186)
==19340==    by 0x40A6CD: main (Transport.cu:81)
==19340==  Address 0x2003000c0 is not stack'd, malloc'd or (recently) free'd
==19340== 
==19340== 
==19340== Process terminating with default action of signal 11 (SIGSEGV)
==19340==  Bad permissions for mapped region at address 0x2003000C0
==19340==    at 0x402C79: CopyDataToHost(param_t, grid*, grid*) (CheckDevice.cu:48)
==19340==    by 0x403646: CheckDevice(param_t, grid*, grid*) (CheckDevice.cu:186)
==19340==    by 0x40A6CD: main (Transport.cu:81)
==19340== 
==19340== HEAP SUMMARY:
==19340==     in use at exit: 2,611,365 bytes in 5,017 blocks
==19340==   total heap usage: 5,879 allocs, 862 frees, 4,332,278 bytes allocated
==19340== 
==19340== LEAK SUMMARY:
==19340==    definitely lost: 0 bytes in 0 blocks
==19340==    indirectly lost: 0 bytes in 0 blocks
==19340==      possibly lost: 37,416 bytes in 274 blocks
==19340==    still reachable: 2,573,949 bytes in 4,743 blocks
==19340==         suppressed: 0 bytes in 0 blocks
==19340== Rerun with --leak-check=full to see details of leaked memory
==19340== 
==19340== For counts of detected and suppressed errors, rerun with: -v
==19340== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2)

解决方案

I believe I know what the problem is, but to confirm it, it would be useful to see the code that you are using to set up the Grid_dev classes on the device.

When a class or other data structure is to be used on the device, and that class has pointers in it which refer to other objects or buffers in memory (presumably in device memory, for a class that will be used on the device), then the process of making this top-level class usable on the device becomes more complicated.

Suppose I have a class like this:

class myclass{
  int myval;
  int *myptr;
  }

I could instantiate the above class on the host, and then malloc an array of int and assign that pointer to myptr, and everything would be fine. To make this class usable on the device and the device only, the process could be similar. I could:

  1. cudaMalloc a pointer to device memory that will hold myclass
  2. (optionally) copy an instantiated object of myclass on the host to the device pointer from step 1 using cudaMemcpy
  3. on the device, use malloc or new to allocate device storage for myptr

The above sequence is fine if I never want to access the storage allocated for myptr on the host. But if I do want that storage to be visible from the host, I need a different sequence:

  1. cudaMalloc a pointer to device memory that will hold myclass, let's call this mydevobj
  2. (optionally) copy an instantiated object of myclass on the host to the device pointer mydevobj from step 1 using cudaMemcpy
  3. Create a separate int pointer on the host, let's call it myhostptr
  4. cudaMalloc int storage on the device for myhostptr
  5. cudaMemcpy the pointer value of myhostptr from the host to the device pointer &(mydevobj->myptr)

After that, you can cudaMemcpy the data pointed to by the embedded pointer myptr to the region allocated (via cudaMalloc) on myhostptr

Note that in step 5, because I am taking the address of this pointer location, this cudaMemcpy operation only requires the mydevobj pointer on the host, which is valid in a cudaMemcpy operation (only).

The value of the device pointer myint will then be properly set up to do the operations you are trying to do. If you then want to cudaMemcpy data to and from myint to the host, you use the pointer myhostptr in any cudaMemcpy calls, not mydevobj->myptr. If we tried to use mydevobj->myptr, it would require dereferencing mydevobj and then using it to retrieve the pointer that is stored in myptr, and then using that pointer as the copy to/from location. This is not acceptable in host code. If you try to do it, you will get a seg fault. (Note that by way of analogy, my mydevobj is like your Grid_dev and my myptr is like your cdata)

Overall it is a concept that requires some careful thought the first time you run into it, and so questions like this come up with some frequency on SO. You may want to study some of these questions to see code examples (since you haven't provided your code that sets up Grid_dev):

  1. example 1
  2. example 2
  3. example 3

这篇关于cudaMemcpy分段故障的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆