cuda - 内存不足(线程和块问题)--Address是超出范围 [英] cuda -- out of memory (threads and blocks issue) --Address is out of bounds
问题描述
我使用63个寄存器/线程,所以(32768是最大)我可以使用约520线程。我在这个例子中现在使用512线程。
(并行性在全局computeEHfields函数函数内的函数computeEvec中。)
问题是:
1)下面的mem检查错误。
2)当我使用numPointsRp> 2000时,它显示内存不足
------------------- ------------更新---------------------------
我用cuda-memcheck运行程序,它给我(只有当numPointsRs> numPointsRp):
=========无效全域大小为4的读取
========= at 0x00000428 computeEHfields
=========通过块(0,0,0)中的线程(2,0,0)
=========地址0x4001076e0超出范围
=========
=========无效全域大小为4的读取
========= at 0x00000428 computeEHfields
=========通过块(0,0,0)中的线程(1,0,0)
=========地址0x4001076e0超出范围
=========
=========无效全域大小为4的读取
========= at 0x00000428 computeEHfields
=========通过块(0,0,0)中的线程(0,0,0)
=========地址0x4001076e0超出范围
错误总结:160错误
----------- EDIT ------------------- ---------
有时候(如果我只使用线程而不是块))如果例如我有numPointsRs = 1000和numPointsRp = 100然后更改numPointsRp = 200,然后再次更改numPointsRp = 100我不采取的第一个结果!
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
从pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
Rs = np.zeros((numPointsRs,3))。astype(np.float32)
用于范围内的k ):
Rs [k] = [0,k,0]
Rp = np.zeros((numPointsRp,3))astype(np.float32)
k在范围内(numPointsRp):
Rp [k] = [1 + k,0,0]
#----初始化和传递数据)到GPU -------------------------
Rs_gpu = gpuarray.to_gpu(Rs)
Rp_gpu = gpuarray.to_gpu (rp)
J_gpu = gpuarray.to_gpu(np.ones((numPointsRs,3))。astype(np.complex64))
M_gpu = gpuarray.to_gpu .ones((numPointsRs,3))。astype(np.complex64))
Evec_gpu = gpuarray.to_gpu(np.zeros((numPointsRp,3))。astype b $ b Hvec_gpu = gpuarray.to_gpu(np.zeros((numPointsRp,3))。astype(np.complex64))
All_gpu = gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64) )
mod = SourceModule(
#include< pycuda-complex.hpp&
#include< cmath>
#include< vector>
#define RowRsSize%(numrs)d
#define RowRpSize%(numrp)d
typedef pycuda :: complex< float> cmplx;
externC{
__device__ void computeEvec(float Rs_mat [] [3],int numPointsRs,
cmplx J [] [3],
cmplx M [] [3],
float * Rp,
cmplx kp,
cmplx eta,
cmplx * Evec,
cmplx * Hvec,cmplx * All)
{
while(c ...
c ++;
}
}
__global__ void computeEHfields(float * Rs_mat_,int numPointsRs,
float * Rp_mat_,int numPointsRp,
cmplx * J_,
cmplx * M_,
cmplx kp,
cmplx eta,
cmplx E [] [3],
cmplx H [] [3 ],cmplx * All)
{
float Rs_mat [RowRsSize] [3];
float Rp_mat [RowRpSize] [3];
cmplx J [RowRsSize] [3];
cmplx M [RowRsSize] [3];
int k = threadIdx.x + blockIdx.x * blockDim.x;
while(k {
computeEvec(Rs_mat,numPointsRs,J,M,Rp_mat [k],kp,eta,E [k] ,H [k],All);
k + = blockDim.x * gridDim.x;
}
}
}
%{numrs:numPointsRs,numrp: numPointsRp},no_extern_c = 1)
func = mod.get_function(computeEHfields)
func(Rs_gpu,np.int32 ),Rp_gpu,np.int32(numPointsRp),J_gpu,M_gpu,np.complex64(kp),np.complex64(eta),Evec_gpu,Hvec_gpu,All_gpu,block =(128,1,1) 1))
print(\\\
)
#-----从GPU获取数据-----
Rs = Rs_gpu.get()
Rp = Rp_gpu.get()
J = J_gpu.get()
M = M_gpu.get()
Evec = Evec_gpu.get ()
Hvec = Hvec_gpu.get()
All = All_gpu.get()
-------------------- GPU MODEL ------------------------- -----------------------
设备0: GeForce GTX 560
CUDA驱动程序版本/运行时版本4.20 / 4.10
CUDA功能主要/次要版本号:2.1
全局内存总量:1024 MByte(1073283072字节)
(0)多处理器×(48)CUDA内核/ MP:0 CUDA内核// CUDA内核336 => 7 MP和48核/ MP
当我使用numPointsRp> 2000它显示我内存不足
现在我们有一些真正的代码,让我们编译它,看看会发生什么。使用 RowRsSize = 2000
和 RowRpSize = 200
并使用CUDA 4.2工具链进行编译,我得到:
nvcc -arch = sm_21 -Xcompiler = - D RowRsSize = 2000 -D RowRpSize = 200-Xptxas = - v-c -I ./ kivekset.cu
ptxas info:为'sm_21'编译条目函数'_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_'
ptxas info:_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_
的函数属性122432字节堆栈帧,0字节溢出存储区,0字节溢出loaded
ptxas info:使用57个寄存器,84个字节cmem [0],168个字节cmem [2],76个字节cmem [16]
关键数字是每个线程57个寄存器和122432字节堆栈帧。占用计算器建议512个线程的块将具有每个SM最多1个块,并且您的GPU具有7个SM。在使用pyCUDA为输入和输出分配单个字节的内存之前,这将提供总共122432 * 512 * 7 = 438796288字节的堆栈帧(本地内存)来运行您的内核。在具有1Gb内存的GPU上,不难想象内存耗尽。您的内核具有巨大的本地内存占用。开始思考减少它的方法。
正如我在评论中指出的,绝对不清楚为什么每个线程需要此内核代码中的输入数据的完整副本。它导致巨大的本地内存占用,似乎绝对没有理由为什么代码应该这样写。你可以,我怀疑,修改内核,像这样:
typedef pycuda :: complex< float& cmplx;
typedef float fp3 [3];
typedef cmplx cp3 [3];
__global__
void computeEHfields2(
float * Rs_mat_,int numPointsRs,
float * Rp_mat_,int numPointsRp,
cmplx * J_,
cmplx * M_,
cmplx kp,
cmplx eta,
cmplx E [] [3],
cmplx H [] [3],
cmplx *
{
fp3 * Rs_mat =(fp3 *)Rs_mat_;
cp3 * J =(cp3 *)J_;
cp3 * M =(cp3 *)M_;
int k = threadIdx.x + blockIdx.x * blockDim.x;
while(k{
fp3 * Rp_mat =(fp3 *)(Rp_mat_ + k);
computeEvec2(Rs_mat,numPointsRs,J,M,* Rp_mat,kp,eta,E [k],H [k],All);
k + = blockDim.x * gridDim.x;
}
}
和主要的__device__函数:
__ device__ void computeEvec2(
fp3 Rs_mat [],int numPointsRs,
cp3 J [],
cp3 M [],
fp3 Rp,
cmplx kp,
cmplx eta,
cmplx * Evec,
cmplx * Hvec,
cmplx * All)
{
....
}
并消除线程本地内存的每个字节,而不改变计算代码的功能。
I am using 63 registers/thread ,so (32768 is maximum) i can use about 520 threads.I am using now 512 threads in this example.
(The parallelism is in the function "computeEvec" inside global computeEHfields function function.) The problems are:
1) The mem check error below.
2) When i use numPointsRp>2000 it show me "out of memory" ,but (if i am not doing wrong) i compute the global memory and it's ok.
-------------------------------UPDATED---------------------------
i run the program with cuda-memcheck and it gives me (only when numPointsRs>numPointsRp):
========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (2,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
========= ========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
========= ========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
ERROR SUMMARY: 160 errors
-----------EDIT----------------------------
Also , some times (if i use only threads and not blocks (i haven't test it for blocks) ) if for example i have numPointsRs=1000 and numPointsRp=100 and then change the numPointsRp=200 and then again change the numPointsRp=100 i am not taking the first results!
import pycuda.gpuarray as gpuarray import pycuda.autoinit from pycuda.compiler import SourceModule import numpy as np import cmath import pycuda.driver as drv Rs=np.zeros((numPointsRs,3)).astype(np.float32) for k in range (numPointsRs): Rs[k]=[0,k,0] Rp=np.zeros((numPointsRp,3)).astype(np.float32) for k in range (numPointsRp): Rp[k]=[1+k,0,0] #---- Initialization and passing(allocate memory and transfer data) to GPU ------------------------- Rs_gpu=gpuarray.to_gpu(Rs) Rp_gpu=gpuarray.to_gpu(Rp) J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64)) M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64)) Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64)) Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64)) All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64)) mod =SourceModule(""" #include <pycuda-complex.hpp> #include <cmath> #include <vector> #define RowRsSize %(numrs)d #define RowRpSize %(numrp)d typedef pycuda::complex<float> cmplx; extern "C"{ __device__ void computeEvec(float Rs_mat[][3], int numPointsRs, cmplx J[][3], cmplx M[][3], float *Rp, cmplx kp, cmplx eta, cmplx *Evec, cmplx *Hvec, cmplx *All) { while (c<numPointsRs){ ... c++; } } __global__ void computeEHfields(float *Rs_mat_, int numPointsRs, float *Rp_mat_, int numPointsRp, cmplx *J_, cmplx *M_, cmplx kp, cmplx eta, cmplx E[][3], cmplx H[][3], cmplx *All ) { float Rs_mat[RowRsSize][3]; float Rp_mat[RowRpSize][3]; cmplx J[RowRsSize][3]; cmplx M[RowRsSize][3]; int k=threadIdx.x+blockIdx.x*blockDim.x; while (k<numPointsRp) { computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All ); k+=blockDim.x*gridDim.x; } } } """% { "numrs":numPointsRs, "numrp":numPointsRp},no_extern_c=1) func = mod.get_function("computeEHfields") func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1)) print(" \n") #----- get data back from GPU----- Rs=Rs_gpu.get() Rp=Rp_gpu.get() J=J_gpu.get() M=M_gpu.get() Evec=Evec_gpu.get() Hvec=Hvec_gpu.get() All=All_gpu.get()
--------------------GPU MODEL------------------------------------------------
Device 0: "GeForce GTX 560" CUDA Driver Version / Runtime Version 4.20 / 4.10 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes (1073283072 bytes) ( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP
解决方案When i use numPointsRp>2000 it show me "out of memory"
Now we have some real code to work with, let's compile it and see what happens. Using
RowRsSize=2000
andRowRpSize=200
and compiling with the CUDA 4.2 toolchain, I get:nvcc -arch=sm_21 -Xcompiler="-D RowRsSize=2000 -D RowRpSize=200" -Xptxas="-v" -c -I./ kivekset.cu ptxas info : Compiling entry function '_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_' for 'sm_21' ptxas info : Function properties for _Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_ 122432 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 57 registers, 84 bytes cmem[0], 168 bytes cmem[2], 76 bytes cmem[16]
The key numbers are 57 registers and 122432 bytes stack frame per thread. The occupancy calculator suggests that a block of 512 threads will have a maximum of 1 block per SM, and your GPU has 7 SM. This gives a total of 122432 * 512 * 7 = 438796288 bytes of stack frame (local memory) to run your kernel, before you have allocated a single of byte of memory for input and output using pyCUDA. On a GPU with 1Gb of memory, it isn't hard to imagine running out of memory. Your kernel has a enormous local memory footprint. Start thinking about ways to reduce it.
As I indicated in comments, it is absolutely unclear why every thread needs a complete copy of the input data in this kernel code. It results in a gigantic local memory footprint and there seems to be absolutely no reason why the code should be written in this way. You could, I suspect, modify the kernel to something like this:
typedef pycuda::complex<float> cmplx; typedef float fp3[3]; typedef cmplx cp3[3]; __global__ void computeEHfields2( float *Rs_mat_, int numPointsRs, float *Rp_mat_, int numPointsRp, cmplx *J_, cmplx *M_, cmplx kp, cmplx eta, cmplx E[][3], cmplx H[][3], cmplx *All ) { fp3 * Rs_mat = (fp3 *)Rs_mat_; cp3 * J = (cp3 *)J_; cp3 * M = (cp3 *)M_; int k=threadIdx.x+blockIdx.x*blockDim.x; while (k<numPointsRp) { fp3 * Rp_mat = (fp3 *)(Rp_mat_+k); computeEvec2( Rs_mat, numPointsRs, J, M, *Rp_mat, kp, eta, E[k], H[k], All ); k+=blockDim.x*gridDim.x; } }
and the main __device__ function it calls to something like this:
__device__ void computeEvec2( fp3 Rs_mat[], int numPointsRs, cp3 J[], cp3 M[], fp3 Rp, cmplx kp, cmplx eta, cmplx *Evec, cmplx *Hvec, cmplx *All) { .... }
and eliminate every byte of thread local memory without changing the functionality of the computational code at all.
这篇关于cuda - 内存不足(线程和块问题)--Address是超出范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!