CUDA线程执行顺序 [英] CUDA thread execution order

查看:272
本文介绍了CUDA线程执行顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CUDA程序的以下代码:

  #include< stdio.h> 

#define NUM_BLOCKS 4
#define THREADS_PER_BLOCK 4

__global__ void hello()
{

printf 。我是一个线程%d在块%d \\\
,threadIdx.x,blockIdx.x);

}


int main(int argc,char ** argv)
{
//启动内核
hello<<<< NUM_BLOCKS,THREADS_PER_BLOCK>>>();

//强制printf()s刷新
cudaDeviceSynchronize();

return 0;
}

其中每个线程将打印其 threadIdx.x blockIdx.x 。这个程序的一个可能输出是:

 您好。我是一个线程0在块0 
你好。我是一个线程1在块0
你好。我是一个线程2在块0
你好。我是一个线程3在块0
你好。我是块2中的线程0
Hello。我是块2中的线程1
Hello。我是块2中的线程2
Hello。我是块2中的第3个线程
Hello。我是块3中的线程0
Hello。我是块3中的线程1
Hello。我是块3中的线程2
Hello。我是块3中的线程3
Hello。我是块1中的线程0
Hello。我是块1中的线程1
Hello。我是块1中的第2个线程
Hello。我是块1中的线程3

运行程序几次我得到类似的结果,块顺序是随机的。例如,在上面的输出中,我们有这个块顺序0,2,3,1.再次运行问题我得到1,2,3,0。这是预期的。然而,每个块中的线程顺序总是0,1,2,3。为什么会发生?我认为这也是随机的。



我试图改变我的代码强制线程0在每个块中花更长的时间来执行。我这样做:

  __ global__ void hello()
{

if threadIdx.x == 0)
{
int k = 0;
for(int i = 0; i <1000000; i ++)
{
k = k + 1;
}
}

printf(Hello,I'm a thread%d in block%d \\\
,threadIdx.x,blockIdx.x);

}

我希望线程顺序为1,2,3 ,0。但是,我得到了一个类似的结果,我上面显示的线程顺序总是0,1,2,3.为什么会这样?

解决方案


但是,每个块中的线程顺序总是0,1,2,3。为什么会发生?我认为这将是随机的


每个块有4个线程,你只会启动一个 warp warp 是执行单位(和调度和资源分配)在CUDA,而不是线程。目前,一个warp包含32个线程。



这意味着每个块的所有4个线程(因为在这种情况下没有条件行为) > lockstep 。当它们到达 printf 函数调用时,它们都在同一行代码中执行对该函数的调用在lockstep <因此,在这种情况下,CUDA运行时如何调度这些同步函数调用?。



这个问题的答案是未指定的,但它不是随机的。因此,对于一个warp内的操作,dispatch的顺序不会因为run而改变,这是合理的。



如果你启动足够的线程来创建每个块的多个warp,还包括一些其他代码来分散和/或随机化经线之间的行为,你应该能够看到从随机顺序发生的单独的经线发出的 printf 操作。 / p>

I have the following code for a CUDA program:

#include <stdio.h>

#define NUM_BLOCKS 4
#define THREADS_PER_BLOCK 4

__global__ void hello()
{  

   printf("Hello. I'm a thread %d in block %d\n", threadIdx.x, blockIdx.x);

}


int main(int argc,char **argv)
{
    // launch the kernel
    hello<<<NUM_BLOCKS, THREADS_PER_BLOCK>>>();

    // force the printf()s to flush
    cudaDeviceSynchronize();

    return 0;
}

in which every thread will print its threadIdx.x and blockIdx.x. One possible output of this program is this:

Hello. I'm a thread 0 in block 0
Hello. I'm a thread 1 in block 0
Hello. I'm a thread 2 in block 0
Hello. I'm a thread 3 in block 0
Hello. I'm a thread 0 in block 2
Hello. I'm a thread 1 in block 2
Hello. I'm a thread 2 in block 2
Hello. I'm a thread 3 in block 2
Hello. I'm a thread 0 in block 3
Hello. I'm a thread 1 in block 3
Hello. I'm a thread 2 in block 3
Hello. I'm a thread 3 in block 3
Hello. I'm a thread 0 in block 1
Hello. I'm a thread 1 in block 1
Hello. I'm a thread 2 in block 1
Hello. I'm a thread 3 in block 1

Running the program several times I get similar results, except that block order is random. For example, in the above output we have this block order 0, 2, 3, 1. Running the problem again I get 1,2,3, 0. This is expected. However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too.

I tried to change my code to force thread 0 in every block to take longer to execute. I did it like this:

__global__ void hello()
{  

    if (threadIdx.x == 0)
    {
        int k = 0;
        for ( int i = 0; i < 1000000; i++ )
        {
            k = k + 1;
        }
    }

   printf("Hello. I'm a thread %d in block %d\n", threadIdx.x, blockIdx.x);

}

I would expect thread order to be 1,2,3, 0. However, I got a similar result to the one I have shown above where thread order was always 0, 1, 2, 3. Why is this happening?

解决方案

However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too

With 4 threads per block you are only launching one warp per block. A warp is the unit of execution (and scheduling, and resource assignment) in CUDA, not a thread. Currently, a warp consists of 32 threads.

This means that all 4 of your threads per block (since there is no conditional behavior in this case) are executing in lockstep. When they reach the printf function call, they all execute the call to that function in the same line of code, in lockstep.

So the question becomes, in this situation, how does the CUDA runtime dispatch these "simultaneous" function calls? The answer to that question is unspecified, but it is not "random". Therefore it's reasonable that the order of dispatch for operations within a warp does not change from run to run.

If you launch enough threads to create multiple warps per block, and probably also include some other code to disperse and or "randomize" the behavior between warps, you should be able to see printf operations emanating from separate warps occurring in "random" order.

这篇关于CUDA线程执行顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆