driver.Context.synchronize() - 什么要考虑 - -a清理操作失败 [英] driver.Context.synchronize()- what else to take into consideration -- -a clean-up operation failed

查看:192
本文介绍了driver.Context.synchronize() - 什么要考虑 - -a清理操作失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有代码(由于回答而修改)。



信息


32字节堆栈帧,0字节溢出存储,0字节溢出加载

ptxas信息:使用46个寄存器,120字节cmem [0],176字节
cmem [2 ],76个字节cmem [16]


我不知道还有什么要考虑,例如,当我用Rs = 10000和Rp = 100000运行代码,其中block =(128,1,...)的时候,我们可以使用numPointsRs和numPointsRp 1),grid =(200,1)它的罚款。



我的计算:


46个寄存器* 128threads = 5888个寄存器。



我的卡有32768个寄存器,所以32768/5888 = 5 + some => 5 block / SM
$



使用占用计算器,我发现使用128个线程/块
给我42%和我在



此外,每个MP的线程数为640(限制为1536)


现在,如果我尝试使用Rs = 100000和Rp = 100000(对于相同的线程和块),它给了我的标题中的消息:


cuEventDestroy失败:启动超时



cuModuleUnload失败:启动超时



1)我不知道/理解需要计算什么。



2)我不明白我们如何使用/我可以看到
,大多数,有人put(threads-1 + points)/ threads,但仍然
不工作。


--------------更新---------------------- ------------------------



使用driver.Context.synchronize (),代码适用于许多点(1000000)!



但是,这个添加到代码中有什么影响(对于许多点屏幕冻结1分钟更多)。我应该使用吗?



-------------- UPDATED2 ------- ---------------------------------------

$ b $



现在,代码不能再工作了。

  import pycuda.gpuarray as gpuarray 
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
import pycuda.tools as t






#----初始化和传递(分配内存和传输数据)到GPU -------------------------
Rs_gpu = gpuarray。 to_gpu(Rs)
Rp_gpu = gpuarray.to_gpu(Rp)

J_gpu = gpuarray.to_gpu(np.ones((numPointsRs,3))。astype(np.complex64))
M_gpu = gpuarray.to_gpu(np.ones((numPointsRs,3))。astype(np.complex64))

Evec_gpu = gpuarray.to_gpu(np.zeros((numPointsRp,3)) .astype(np.complex64))
Hvec_gpu = gpuarray.to_gpu(np.zeros((numPointsRp,3))。astype(np.complex64))
All_gpu = gpuarray.to_gpu numPointsRp).astype(np.complex64))

#------------------------------- -------------------------------------------------- -
mod = SourceModule(
#include< pycuda-complex.hpp>
#include< cmath>
#include< vector>

typedef pycuda :: complex< float> cmplx;
typedef float fp3 [3];
typedef cmplx cp3 [3];

__device__ __constant__ float Pi;

externC{


__device__ void computeEvec(fp3 Rs_mat [],int numPointsRs,
cp3 J [],
cp3 M [],
fp3 Rp,
cmplx kp,
cmplx eta,
cmplx * Evec,
cmplx * Hvec,cmplx * All)

{

while(c
...
c ++;

}
}


__global__ void computeEHfields(float * Rs_mat_,int numPointsRs,
float * Rp_mat_,int numPointsRp,
cmplx * J_,
cmplx * M_,
cmplx kp,
cmplx eta,
cmplx E [] [3],
cmplx H [] [3 ],cmplx * All)
{

fp3 * Rs_mat =(fp3 *)Rs_mat_;
fp3 * Rp_mat =(fp3 *)Rp_mat_;
cp3 * J =(cp3 *)J_;
cp3 * M =(cp3 *)M_;


int k = threadIdx.x + blockIdx.x * blockDim.x;

while(k {

computeEvec(Rs_mat,numPointsRs,J,M,Rp_mat [k],kp,eta,E [k] ,H [k],All);
k + = blockDim.x * gridDim.x;

}

}
}

,no_extern_c = 1,options = [' - ptxas-options = v'])


#call函数(内核)
func = mod.get_function(computeEHfields)

func(Rs_gpu,np .int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu,M_gpu,np.complex64(kp),np.complex64(eta),Evec_gpu,Hvec_gpu,All_gpu,block =(128,1,1) =(200,1))


#-----从GPU返回数据-----
Rs = Rs_gpu.get()
Rp = Rp_gpu.get()
J = J_gpu.get()
M = M_gpu.get()
Evec = Evec_gpu.get()
Hvec = Hvec_gpu.get )
All = All_gpu.get()

我的卡:

 设备0:GeForce GTX 560
CUDA驱动程序版本/运行时版本4.20 / 4.10
CUDA功能主要/次要版本号:2.1
全局内存总量:1024 MBytes(1073283072字节)
(0)多处理器x(48)CUDA内核/ MP:0 CUDA内核// CUDA内核336 => 7 MP和48核心/ MP


解决方案

必须处理。由@njuffa提供的答案1是最好的一般解决方案。我会根据您提供的有限数据提供更多反馈。


  1. 46个寄存器的PTX输出不是您的内核使用的寄存器。 PTX是中间表示。离线或JIT编译器会将其转换为设备代码。设备代码可以使用更多或更少的寄存器。


  2. 占用率计算不仅仅是RegistersPerSM / Nsight Visual Studio版本,Visual Profiler和CUDA命令行分析器都提供了正确的寄存器计数。 RegistersPerThread。根据粒度分配寄存器。对于CC 2.1,粒度是每个线程每个线程(128个寄存器)4个寄存器。


  3. 在您的占用计算中,您表示





我的卡有32768个注册表,所以32768/5888 = 5 + some => 5 block / SM
(我的卡有限制6)。


我不知道6是什么意思。您的设备有7个SM。 2.x设备的每个SM的最大块数是每个SM 8个块。


  1. 您提供的代码不足。如果您提供了一些代码段,请提供所有输入的大小,每个循环执行的次数,以及每个函数的操作描述。看看代码,你可能每个线程做太多的循环。


  2. 由于启动是超时的,你应该可以按如下方式进行调试:

    p>

a。在代码开头添加一行

  if(blockIdx.x> 0){return; } 

运行您在前面提到的一个分析器中的确切代码,块。使用分析器提供的启动信息:每个线程的寄存器,共享内存...使用分析器中的占用计算器或xls来确定可以并发运行的最大块数。例如,如果理论块占用率为每个SM 3个块,并且SM的数量为7,那么您可以一次运行21个块,您将启动9个波。注意:这假设每个螺纹的工作量相等。更改提前退出代码以允许1波(21块)。如果这个启动超时,那么你需要减少每个线程的工作量。如果这通过,然后计算你有多少波,并估计什么时候你会超时(windows上的2秒,?linux)。



b。如果你有太多的波浪,那么减少你必须减少启动配置。假设您通过gridDim.x和blockDim.x索引,您可以通过将这些维度作为参数传递到内核来实现。这将需要tou来最小化更改您的索引代码。你还必须传递一个blockIdx.x偏移量。更改您的主机代码以反复启动多个内核。因为没有冲突,你可以rr启动这些在多个流中受益于每个波浪结束时的重叠。


I have this code here (modified due to the answer).

Info

32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 120 bytes cmem[0], 176 bytes cmem[2], 76 bytes cmem[16]

I don't know what else to take into consideration in order to make it work for different combinations of points "numPointsRs" and "numPointsRp"

When ,for example, i run the code with Rs=10000 and Rp=100000 with block=(128,1,1),grid=(200,1) its fine.

My computations:

46 registers*128threads=5888 registers .

My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).

With the occupancy calculator i found that using 128 threads/block gives me 42% and am in the limits of my card.

Also,the number of threads per MP is 640 (limit is 1536)

Now,if i try to use Rs=100000 and Rp=100000 (for the same threads and blocks) it gives me the message in the title,with:

cuEventDestroy failed: launch timeout

cuModuleUnload failed: launch timeout

1) I don't know/understand what else is needed to be computed.

2) I can't understand how we use/find the number of the blocks.I can see that mostly,someone puts (threads-1+points)/threads ,but that still doesn't work.

--------------UPDATED----------------------------------------------

After using driver.Context.synchronize() ,the code works for many points (1000000)!

But ,what impact has this addition to the code?(for many points the screen freezes for 1 minute or more).Should i use it or not?

--------------UPDATED2----------------------------------------------

Now,the code doesn't work again without doing anything!

Snapshot of code:

import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
import pycuda.tools as t






#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)

J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))

Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))

#-----------------------------------------------------------------------------------    
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>

typedef  pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];

__device__ __constant__ float Pi;

extern "C"{  


    __device__ void computeEvec(fp3 Rs_mat[], int numPointsRs,   
         cp3 J[],
         cp3 M[],
         fp3 Rp,
         cmplx kp, 
         cmplx eta,
         cmplx *Evec,
         cmplx *Hvec, cmplx *All)

{

            while (c<numPointsRs){

        ...                      
                c++;

                }        
        }


__global__  void computeEHfields(float *Rs_mat_, int numPointsRs,     
        float *Rp_mat_, int numPointsRp,     
    cmplx *J_,
    cmplx *M_,
    cmplx  kp, 
    cmplx  eta,
    cmplx E[][3],
    cmplx H[][3], cmplx *All )
    {

        fp3 * Rs_mat=(fp3 *)Rs_mat_;
        fp3 * Rp_mat=(fp3 *)Rp_mat_;
        cp3 * J=(cp3 *)J_;
        cp3 * M=(cp3 *)M_;


    int k=threadIdx.x+blockIdx.x*blockDim.x;

      while (k<numPointsRp)  
     {

        computeEvec( Rs_mat, numPointsRs,  J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
        k+=blockDim.x*gridDim.x;

    }

}
}

""" ,no_extern_c=1,options=['--ptxas-options=-v'])


#call the function(kernel)
func = mod.get_function("computeEHfields")

func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))


#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()

My card:

Device 0: "GeForce GTX 560"
  CUDA Driver Version / Runtime Version          4.20 / 4.10
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073283072 bytes)
  ( 0) Multiprocessors x (48) CUDA Cores/MP:     0 CUDA Cores   //CUDA Cores    336 => 7 MP and 48 Cores/MP

解决方案

There are quite a few issues that you have to deal with. Answer 1 provided by @njuffa is the best general solution. I'll provide more feedback based upon the limited data you have provided.

  1. PTX output of 46 registers is not the number of registers used by your kernel. PTX is an intermediate representation. The offline or JIT compiler will convert this to device code. Device code may use more or less registers. Nsight Visual Studio Edition, the Visual Profiler, and the CUDA command line profiler can all provide you the correct register count.

  2. The occupancy calculation is not simply RegistersPerSM / RegistersPerThread. Registers are allocated based upon a granularity. For CC 2.1 the granularity is 4 registers per thread per warp (128 registers). 2.x devices can actually allocate at a 2 register granularity but this can lead to fragmentation later in the kernel.

  3. In your occupancy calculation you state

My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM (my card has limit 6).

I'm not sure what 6 means. Your device has 7 SMs. The maximum blocks per SM for 2.x devices is 8 blocks per SM.

  1. You have provided an insufficient amount of code. If you provide pieces of code please provide the size of all inputs, the number of times each loop will be executed, and a description of the operations per function. Looking at the code you may be doing too many loops per thread. Without knowing the order of magnitude of the outer loop we can only guess.

  2. Given that the launch is timing out you should probably approach debugging as follows:

a. Add a line to the beginning of the code

if (blockIdx.x > 0) { return; }

Run the exact code you have in one of the previously mentioned profilers to estimate the duration of a single block. Using the launch information provided by the profiler: register per thread, shared memory, ... use the occupancy calculator in the profiler or the xls to determine the maximum number of blocks that you can run concurrently. For example, if the theoretical block occupancy is 3 blocks per SM, and the number of SMs is 7 the you can run 21 blocks at a time which for you launch is 9 waves. NOTE: this assumes equal work per thread. Change the early exit code to allow 1 wave (21 blocks). If this launch times out then you need to reduce the amount of work per thread. If this passes then calculate how many waves you have and estimate when you will timeout (2sec on windows, ? on linux).

b. If you have too many waves then reduce you have to reduce the launch configuration. Given that you index by gridDim.x and blockDim.x you can do this by passing in these dimensions as as parameters to your kernel. This will require tou to minimally change your indexing code. You will also have to pass a blockIdx.x offset. Change your host code to launch multiple kernels back to back. Since there should be no conflict you can rr launch these in multiple streams to benefit from overlap at the end of each wave.

这篇关于driver.Context.synchronize() - 什么要考虑 - -a清理操作失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆