使用CUDA并行化四个或更多嵌套循环 [英] Parallelize four and more nested loops with CUDA

查看:1302
本文介绍了使用CUDA并行化四个或更多嵌套循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编译生成并行C ++代码。我是CUDA编程的新手,但我试图用CUDA并行化C ++代码。



目前如果我有以下的C ++代码:

  for(int i = 0; i  for(int j = 0; j< b; j ++) {
for(int k = 0; k A [i * y * z + j * z + k * z + 1]
}
}
}

CUDA代码:

  __ global__ void kernelExample(){
int _cu_x =((blockIdx.x * blockDim.x) + threadIdx.x);
int _cu_y =((blockIdx.y * blockDim.y)+ threadIdx.y);
int _cu_z =((blockIdx.z * blockDim.z)+ threadIdx.z);

A [_cu_x * y * z + _cu_y * z + _cu_z] = 1;
}

,所以每个循环嵌套映射到一个维度,方式来并行化四个和更多的嵌套循环:

  for(int i = 0; i  for(int j = 0; j  for(int k = 0; k  for(int l = 0; d; l ++){
A [i * x * y * z + j * y * z + k * z + 1]
}
}
}
}

有什么类似的方法吗?值得注意的是:所有循环维度都是平行的,并且在迭代之间没有依赖关系。



提前感谢!



:目标是将所有迭代映射到CUDA线程,因为所有迭代都是独立的,可以并发执行。

解决方案

保持外环不变。此外,最好使用 .x 作为最内层循环,这样您可以有效地访问全局内存

  __ global__ void kernelExample(){
int _cu_x =((blockIdx.x * blockDim.x)+ threadIdx.x);
int _cu_y =((blockIdx.y * blockDim.y)+ threadIdx.y);
int _cu_z =((blockIdx.z * blockDim.z)+ threadIdx.z);
for(int i = 0; i A [i * x * y * z + _cu_z * y * z + _cu_y * z + _cu_x]但是如果你的 a,
}
}


$ b < b,c,d
都很小,你可能无法获得足够的并行性。在这种情况下,您可以将线性索引转换为nD索引。

  __ global__ void kernelExample(){
int tid = ((blockIdx.x * blockDim.x)+ threadIdx.x);
int i = tid /(b * c * d);
int j = tid /(c * d)%b;
int k = tid / d%c;
int l = tid%d;

A [i * x * y * z + j * y * z + k * z + 1] = 1;
}

但要注意计算 i,j,k ,l 可能会引入大量开销,因为整数除法和mod在GPU上很慢。作为替代,您可以将 i,j 映射到 .z .y ,并以类似的方式仅从 .x 计算 k,l p>

I am working on a compiler generating parallel C++ code. I am new to CUDA programming but I am trying to parallelize the C++ code with CUDA.

Currently if I have the following sequential C++ code:

for(int i = 0; i < a; i++) {
    for(int j = 0; j < b; j++) {
        for(int k = 0; k < c; k++) {
            A[i*y*z + j*z + k*z +l] = 1;
        }
    }
}

and this results in the following CUDA code:

__global__ void kernelExample() {
    int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
    int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
    int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);

    A[_cu_x*y*z + _cu_y*z + _cu_z] = 1;
}

so each loop nest is mapped to one dimension, but what would be the correct way to parallelize four and more nested loops:

for(int i = 0; i < a; i++) {
    for(int j = 0; j < b; j++) {
        for(int k = 0; k < c; k++) {
            for(int l = 0; l < d; l++) {
                A[i*x*y*z + j*y*z + k*z +l] = 1;
            }
        }
    }
}

Is there any similar way? Noteworthy: all loop dimensions are parallel and there are no dependencies between iterations.

Thanks in advance!

EDIT: the goal is to map all iterations to CUDA threads, since all iterations are independent and could be executed concurrently.

解决方案

You could keep the outer loop unchanged. Also it is better to use .x as inner most loop so you can access the global memory efficiently.

__global__ void kernelExample() {
    int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
    int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
    int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
    for(int i = 0; i < a; i++) {
        A[i*x*y*z + _cu_z*y*z + _cu_y*z + _cu_x] = 1;
    }
}

However if your a,b,c,d are all very small, you may not be able to get enough parallelism. In that case you could convert a linear index to n-D indices.

__global__ void kernelExample() {
    int tid = ((blockIdx.x*blockDim.x)+threadIdx.x);
    int i = tid / (b*c*d);
    int j = tid / (c*d) % b;
    int k = tid / d % c;
    int l = tid % d;

    A[i*x*y*z + j*y*z + k*z + l] = 1;
}

But be careful that calculating i,j,k,l may introduce a lot of overhead as integer division and mod are slow on GPU. As an alternative you could map i,j to .z and .y, and calculate only k,l and more dimensions from .x in a similar way.

这篇关于使用CUDA并行化四个或更多嵌套循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆