对于CUDA嵌套循环 [英] For nested loops with CUDA

查看:750
本文介绍了对于CUDA嵌套循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些对我有从C / C ++转换成CUDA嵌套循环的一个问题。基本上我有4它们共享相同的阵列,使位移位运算嵌套循环。

I'm having a problem with some for nested loops that I have to convert from C/C++ into CUDA. Basically I have 4 for nested loops which are sharing the same array and making bit shift operations.

#define N 65536

// ----------------------------------------------------------------------------------

int a1,a2,a3,a4, i1,i2,i3,i4;

int Bit4CBitmapLookUp[16] = {0, 1, 3, 3, 7, 7, 7, 7, 15, 15, 15, 15, 15, 15, 15, 15};

int _cBitmapLookupTable[N];

int s = 0;  // index into the cBitmapLookupTable

for (i1 = 0; i1 < 16; i1++)
{
    // first customer
    a1 = Bit4CBitmapLookUp[i1] << 12;

    for (i2 = 0; i2 < 16; i2++)
    {
        // second customer
        a2 = Bit4CBitmapLookUp[i2] << 8;

        for (i3 = 0; i3 < 16; i3++)
        {
            // third customer
            a3 = Bit4CBitmapLookUp[i3] << 4;

            for (i4 = 0;i4 < 16;i4++)
            {
                // fourth customer
                a4 = Bit4CBitmapLookUp[i4];

                // now actually set the sBitmapLookupTable value
                _cBitmapLookupTable[s] = a1 | a2 | a3 | a4;

                s++;

            } // for i4
        } // for i3
    } // for i2
} // for i1

这是code,我应该转换成CUDA。我尝试不同的方法,但每次我有错输出。在这里,我张贴我的版本CUDA转换(片从内核的一部分)

This is the code that I should convert into CUDA. I tried different ways but everytime i having the wrong output. Here i post my version of CUDA conversion (the piece from kernel's part)

#define N 16

//----------------------------------------------------------------------------------

// index for the GPU
int i1 = blockDim.x * blockIdx.x + threadIdx.x;
int i2 = blockDim.y * blockIdx.y + threadIdx.y;
int i3 = i1;
int i4 = i2;

__syncthreads();
for(i1 = i2 = 0; i1 < N, i2 < N; i1++, i2++)
{
    // first customer
    a1 = Bit4CBitmapLookUp_device[i1] << 12;

    // second customer
    a2 = Bit4CBitmapLookUp_device[i2] << 8;

    for(i3 = i4 = 0; i3 < N, i4 < N; i3++, i4++){
        // third customer
        a3 = Bit4CBitmapLookUp_device[i3] << 4;

        // fourth customer
        a4 = Bit4CBitmapLookUp_device[i4];

        // now actually set the sBitmapLookupTable value
        _cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
        s++;
    }
} 

我品牌在新的CUDA和我还在学习,但我真的无法找到那些嵌套循环的解决方案。
谢谢你在前进。

I'm brand new in CUDA and I'm still learning, but really i can't find a solution for those for nested loops. Thank you in advance.

推荐答案

由于leftaroundabout已经指出有一个与初始化一个问题。我建议是,你重写程序如下:

As leftaroundabout already indicated there's a problem with the initialization. What I would recommend is that you rewrite the program as follows

int i1 = blockDim.x * blockIdx.x + threadIdx.x;
int i2 = blockDim.y * blockIdx.y + threadIdx.y;
int i3;
int i4;

while(i1 < N && i2 < N){
  a1 = ..;
  a2 = ..;
  for(i3 = i4 = 0; i3 < N, i4 < N; i3++, i4++){
    // third customer
    a3 = Bit4CBitmapLookUp_device[i3] << 4;

    // fourth customer
    a4 = Bit4CBitmapLookUp_device[i4];

    // now actually set the sBitmapLookupTable value
    _cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
    s ++;
  }
  s += blockDim.x*gridDim.x*blockDim.y*gridDim.y;
  i1 += blockDim.x*gridDim.x;
  i2 += blockDim.y*gridDim.y;
}

我没有测试过,所以我不能保证指数是正确的。我会留给你。

I haven't tested it, so I can't guarantee that the indices are correct. I'll leave that to you.

进一步的解释:在code以上只是遍历I1和I2并行化。这假设是n ** 2足够大相比,你有你的GPU内核的数量。如果不是这种情况。所有四个环需要,以获得一个有效的方案来进行并行化。然后,方法是有点不同。

A bit more explanation: In the code above only the loops over i1 and i2 are parallelized. This assumes that N**2 is large enough compared to the number of cores you have on your GPU. If this is not the case. All four loops need to be parallelized in order to obtain an efficient program. The approach would then be a bit different.

这篇关于对于CUDA嵌套循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆