#pragma展开功能到底有什么作用?它会影响线程数吗? [英] What does #pragma unroll do exactly? Does it affect the number of threads?

查看:78
本文介绍了#pragma展开功能到底有什么作用?它会影响线程数吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA的新手,我无法理解循环展开.我已经编写了一段代码来了解该技术

I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique

__global__ void kernel(float *b, int size)
{
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
 #pragma unroll
    for(int i=0;i<size;i++)
        b[i]=i;
}

以上是我的内核功能.在 main 中,我将其命名如下

Above is my kernel function. In main I call it like below

int main()
{
    float * a; //host array
    float * b; //device array
    int size=100;

    a=(float*)malloc(size*sizeof(float));
    cudaMalloc((float**)&b,size);
    cudaMemcpy(b, a, size, cudaMemcpyHostToDevice);

    kernel<<<1,size>>>(b,size); //size=100

    cudaMemcpy(a, b, size, cudaMemcpyDeviceToHost);

    for(int i=0;i<size;i++)
        cout<<a[i]<<"\t";

    _getch();

    return 0;
}

这是否意味着我正在运行 size * size = 10000个线程来执行程序?展开循环时是否创建了100个?

Does it mean I have size*size=10000 threads running to execute the program? Are 100 of them created when loop is unrolled?

推荐答案

否.这意味着您已经用一个块调用了CUDA内核,并且一个块有100个活动线程.您正在将size作为第二个函数参数传递给内核.在您的内核中,这100个线程中的每个线程都会执行for循环100次.

No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times.

#pragma展开是编译器优化,例如可以替换

#pragma unroll is a compiler optimization that can, for example, replace a piece of code like

for ( int i = 0; i < 5; i++ )
    b[i] = i;

使用

b[0] = 0;
b[1] = 1;
b[2] = 2;
b[3] = 3;
b[4] = 4;

通过在循环之前放置 #pragma unroll 指令来

.展开版本的好处在于,它减少了处理器的处理负荷.对于 for 循环版本,除了将每个 i 分配给 b [i] 外,该处理还涉及 i 初始化,对 i< 5 进行6次评估,然后对 i 进行5次递增.在第二种情况下,它仅涉及提交 b 数组内容(如果以后使用 i ,则可能加上 int i = 5; .).循环展开的另一个好处是增强了指令级并行性(ILP).在展开版本中,处理器可能会有更多操作推入处理管道,而不必担心每次迭代中的 for 循环条件.

by putting #pragma unroll directive right before the loop. The good thing about the unrolled version is that it involves less processing load for the processor. In case of for loop version, the processing, in addition to assigning each i to b[i], involves i initialization, evaluating i<5 for 6 times, and incrementing i for 5 times. While in the second case, it only involves filing up b array content (perhaps plus int i=5; if i is used later). Another benefit of loop unrolling is the enhancement of Instruction-Level Parallelism (ILP). In the unrolled version, there would possibly be more operations for the processor to push into processing pipeline without being worried about the for loop condition in every iteration.

诸如之类的帖子解释了CUDA无法发生运行时循环展开.在您的情况下,CUDA编译器不知道 size 将为100的任何线索,因此不会发生编译时循环展开,因此,如果强制展开,则可能最终会损害性能.

Posts like this explain that runtime loop unrolling cannot happen for CUDA. In your case CUDA compiler doesn't have any clues that size is going to be 100 so compile-time loop unrolling will not occur, and so if you force unrolling, you may end up hurting the performance.

如果确定所有执行的 size 为100,则可以展开循环,如下所示:

If you are sure that the size is 100 for all executions, you can unroll your loop like below:

#pragma unroll
for(int i=0;i<SIZE;i++)  //or simply for(int i=0;i<100;i++)
    b[i]=i;

其中 SIZE 在编译时已知为 #define SIZE 100 .

我还建议您在代码中进行适当的CUDA错误检查(在此处说明).

I also suggest you to have proper CUDA error checking in your code (explained here).

这篇关于#pragma展开功能到底有什么作用?它会影响线程数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆