CUDA内核的向量长度基于threadIdx [英] CUDA kernel's vectors' length based on threadIdx

查看:314
本文介绍了CUDA内核的向量长度基于threadIdx的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在CUDA中作为图像重建算法的一部分实现的伪代码的一部分:

  (0->检测器XDim / 2-1):对于每个ybin(0-> detectorYDim-1)

rayInit =(xbin * xBinSize + 0.5,ybin * xBinSize + 0.5,-detectordistance )
rayEnd = beamFocusCoord
slopeVector = rayEnd-rayInit
//知道r = rayInit + t * slopeVector;
// x = rayInit [0] + t * slopeVector [0]
// y = rayInit [1] + t * slopeVector [1]
// z = rayInit [2] + t * slopeVector [2]
//找到光线xx交点:$ b​​ $ b每个xinteger(xbin + 1-> detectionXDim / 2):
解决t对于x = xinteger * xBinSize ;
找到相应的y和z
添加到交集数组
//找到ray yy交集(类似于xx交集)
//找到ray zz交集(类似于xx交集)

到目前为止,这是我想出来的:

  __ global__ void sysmat(int xfocus,int yfocus,int zfocus,int xbin,int xbinsize,int ybin,int ybinsize,int zbin,int projecoes){

int tx = threadIdx.x,ty = threadIdx.y,tz = threadIdx.z,bx = blockIdx.x,by = blockIdx.y,i,x,y,z;
int idx = ty + by * blocksize;
int idy = tx + bx * blocksize;

int slopeVectorx = xfocus-idx * xbinsize + 0.5;
int slopeVectory = yfocus-idy * ybinsize + 0.5;
int slopeVectorz = zfocus-zdetector;
__syncthreads();


//光线与x轴相交的点
int xint = idx + 1;
int yint = idy + 1;
int * intersectionsx [(detectorXDim / 2-xint)+(detectorYDim-yint)+(zfocus)];
int * intersectionsy [(detectorXDim / 2-xint)+(detectorYDim-yint)+(zfocus)];
int * intersectionsz [(detectorXDim / 2-xint)+(detectorYDim-yint)+(zfocus)];

for(xint = xint; xint< detectXDim / 2; xint ++){
x = xint * xbinsize;
t =(x-idx)/ slopeVectorx;
y = idy + t * slopeVectory;
z = z + t * slopeVectorz;

intersectionsx [xint-1] = x;
intersectionsy [xint-1] = y;
intersectionsz [xint-1] = z;
__syncthreads();
}
...
}

一段代码。我知道可能有一些错误(你可以指出他们,如果他们是明显错误),但我更关心的是这样:



每个线程检测器箱)需要三个阵列,因此它可以保存光线(通过该线/箱)与x,y和z轴的倍数相交的点。每个数组的长度取决于在检测器和beamFocusCoord(它们是固定的)线程/ bin(它的索引)的位置。为了做到这一点我写了这段代码,我确定不能做(确认它与一个小测试内核,并返回错误:表达式必须有常量值):

  int * intersectionsx [(detectorXDim / 2-xint)+(detectorXDim-yint)+(zfocus)]; 
int * intersectionsy [(detectorXDim / 2-xint)+(detectorXDim-yint)+(zfocus)];
int * intersectionsz [(detectorXDim / 2-xint)+(detectorXDim-yint)+(zfocus)];

所以最后,我想知道是否有一个替代这段代码,向量的长度取决于分配该向量的线程的索引。



提前感谢;)



编辑:给定每个线程将必须保存与光线(从光束源到检测器)和xx,yy和zz轴之间的交点的坐标的数组,并且空间维度在(I现在没有确切的数字,但他们非常接近真正的价值)1400x3600x60,这个问题可行的CUDA?



例如,线程,0)将在x轴上有1400个交叉点,y轴上有3600个交叉点,z轴上有60个交叉点,这意味着我必须创建一个大小为(1400 + 3600 + 60)* sizeof(float)的数组每个线程为20kb。



因此,假设每个线程超过16kb的本地内存,这是不可能的。另一个替代方法是分配这些数组,但是有了更多的数学,我们得到(1400 + 3600 + 60)* 4 *线程数(即1400 * 3600),这也超过了全局内存的可用量:(



因此,我缺乏处理这个问题的想法,任何帮助都很感激。

解决方案

否。



在内核启动时,CUDA中的每一块内存必须是已知的。对于全局内存,共享内存和寄存器,这是真的。



常见的解决方法是分配预先需要的最大内存大小,一个线程线程需要的最大大小 - 多次,或者复杂为将总线最大值的所有线程需求大小相加,并计算适当的线程偏移到该数组中,这是内存分配和偏移计算时间之间的权衡。 / p>

如果你可以和为复杂,如果你必须,由于内存限制去寻找简单的解决方案。


This is part of the pseudo code I am implementing in CUDA as part of an image reconstruction algorithm:

for each xbin(0->detectorXDim/2-1):
 for each ybin(0->detectorYDim-1):
      rayInit=(xbin*xBinSize+0.5,ybin*xBinSize+0.5,-detectordistance)
      rayEnd=beamFocusCoord
      slopeVector=rayEnd-rayInit
      //knowing that r=rayInit+t*slopeVector;
      //x=rayInit[0]+t*slopeVector[0]
      //y=rayInit[1]+t*slopeVector[1]
      //z=rayInit[2]+t*slopeVector[2]
      //to find ray xx intersections:
      for each xinteger(xbin+1->detectorXDim/2):
                solve t for x=xinteger*xBinSize;
                find corresponding y and z
                add to intersections array
      //find ray yy intersections(analogous to xx intersections)
      //find ray zz intersections(analogous to xx intersections)

So far, this is what I have come up with:

__global__ void sysmat(int xfocus,int yfocus, int zfocus, int xbin,int xbinsize,int ybin,int ybinsize, int zbin, int projecoes){

    int tx=threadIdx.x, ty=threadIdx.y,tz=threadIdx.z, bx=blockIdx.x, by=blockIdx.y,i,x,y,z;
    int idx=ty+by*blocksize;
    int idy=tx+bx*blocksize;

    int slopeVectorx=xfocus-idx*xbinsize+0.5;
    int slopeVectory=yfocus-idy*ybinsize+0.5;
    int slopeVectorz=zfocus-zdetector;
    __syncthreads();


    //points where the ray intersects x axis
    int xint=idx+1;
    int yint=idy+1;
    int*intersectionsx[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
    int*intersectionsy[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
    int*intersectionsz[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];

    for(xint=xint; xint<detectorXDim/2;xint++){
            x=xint*xbinsize;
            t=(x-idx)/slopeVectorx;
            y=idy+t*slopeVectory;
            z=z+t*slopeVectorz;

            intersectionsx[xint-1]=x;
            intersectionsy[xint-1]=y;
            intersectionsz[xint-1]=z;
            __syncthreads();
    }
    ...
    }

This is just a piece of the code. I know that there might be some errors(you can point them if they are blatantly wrong) but what I am more concerned is this:

Each thread(which corresponds to a detector bin) needs three arrays so it can save the points where the ray(which passes through this thread/bin) intersects multiples of the x,y and z axis. Each array's length depend on the place of the thread/bin(it's index) in the detector and on the beamFocusCoord(which are fixed). In order to do this I wrote this piece of code, which I am certain can not be done(confirmed it with a small test kernel and it returns the error: "expression must have constant value"):

    int*intersectionsx[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
    int*intersectionsy[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
    int*intersectionsz[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];

So in the end, I want to know if there is an alternative to this piece of code, where a vector's length depends on the index of the thread allocating that vector.

Thank you in advance ;)

EDIT: Given that each thread will have to save an array with the coordinates of the intersections between the ray(that goes from the beam source to the detector) and the xx,yy and zz axis, and that the spacial dimensions are around(I dont have the exact numbers at the moment, but they are very close to the real value) 1400x3600x60, is this problem feasible with CUDA?

For example, the thread (0,0) will have 1400 intersections in the x axis, 3600 in the y axis and 60 in the z axis, meaning that I will have to create an array of size (1400+3600+60)*sizeof(float) which is around 20kb per thread.

So given that each thread surpasses the 16kb local memory, that is out of the question. The other alternative was to allocate those arrays but, with some more math, we get (1400+3600+60)*4*numberofthreads(i.e. 1400*3600), which also surpasses the ammount of global memory available :(

So I am running out of ideas to deal with this problem and any help is appreciated.

解决方案

No.

Every piece of memory in CUDA must be known at kernel-launch time. You can't allocate/deallocate/change anything while the kernel is running. This is true for global memory, shared memory and registers.

The common workaround is the allocate the maximum size of memory needed beforehand. This can be as simple as allocating the maximum size needed for one thread thread-multiple times or as complex as adding up all those thread-needed sizes for a total maximum and calculating appropriate thread-offsets into that array. That's a tradeoff between memory allocation and offset-computation time.

Go for the simple solution if you can and for the complex if you have to, due to memory limitations.

这篇关于CUDA内核的向量长度基于threadIdx的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆