复杂的循环要移植到CUDA内核 [英] Complicated for loop to be ported to a CUDA kernel
本文介绍了复杂的循环要移植到CUDA内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有下一个嵌套循环,我想将其移植到CUDA以便在GPU上运行
I have the next for nested loop and I would like to port it to CUDA to be run on a GPU
int current=0;
int ptr=0;
for (int i=0; i < Nbeans; i++){
for(int j=0;j< NbeamletsPerbeam[i];j++){
current = j + ptr;
for(int k=0;k<Nmax;k++){
......
}
ptr+=NbeamletsPerbeam[i];
}
}
如果有人有想法,我会很高兴如何做或如何做
我们正在谈论Nbeams = 5,NbeamletsPerBeam每个大约200。
I would be very happy if any body has an idea of how to do it or how can be done. We are talking about Nbeams=5, NbeamletsPerBeam around 200 each.
这是我目前拥有的,但我不确定这是正确的... / p>
This is what I currently have but I am not sure it is right...
for (int i= blockIdx.x; i < d_params->Nbeams; i += gridDim.x){
for (int j= threadIdx.y; j < d_beamletsPerBeam[i]; j+= blockDim.y){
currentBeamlet= j+k;
for (int ivoxel= threadIdx.x; ivoxel < totalVoxels; ivoxel += blockDim.x){
推荐答案
我会建议这个想法。但是,您可能需要根据代码进行一些小的修改。
I would suggest this idea. But you might need to do some minor modifications based on your code.
dim3 blocks(NoOfThreads, 1);
dim3 grid(Nbeans, 1);
kernel<<grid, blocks, 1>>()
__global__ kernel()
{
int noOfBlocks = ( NbeamletsPerbeam[blockIdx.x] + blockDim.x -1)/blockDim.x;
for(int j=0; j< noOfBlocks;j++){
// use threads and compute....
if( (threadIdx.x * j) < NbeamletsPerbeam[blockIdx.x]) {
current = (threadIdx.x * j) + ptr;
for(int k=0;k<Nmax;k++){
......
}
ptr+=NbeamletsPerbeam[blockIdx.x];
}
}
}
为您提供更好的并行化。
This should do the trick and gives you better parallelization.
这篇关于复杂的循环要移植到CUDA内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文