经验地确定经线中有多少线程 [英] Empirically determining how many threads are in a warp

查看:93
本文介绍了经验地确定经线中有多少线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以编写一个CUDA内核,显示一个warp中有多少线程,而不使用任何warp相关的CUDA设备函数,而不使用基准测试?如果是,如何?

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?

推荐答案

因为你表示一个解决方案与atomics将是有趣的,我提出这一点,一个答案,但我不知道这是否一定是你正在寻找的答案。我承认这是有点统计性质。我提供这只是因为我发现这个问题很有趣。我不建议这是正确的答案,我怀疑有人聪明会拿出一个更好的答案。但是,这可能会提供一些想法。

Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.

为了避免使用任何明确引用warp的东西,我认为有必要关注隐式warp同步行为。我最初走了一条路,思考如何使用一个if-then-else结构(它有一些扭曲同步的影响),但是努力,并提出了这种方法:

In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:

#include <stdio.h>
#define LOOPS 100000

__device__ volatile int test2 = 0;
__device__ int test3 = 32767;

__global__ void kernel(){

  for (int i = 0; i < LOOPS; i++){
    unsigned long time = clock64();
//    while (clock64() < (time + (threadIdx.x * 1000)));
    int start = test2;
    atomicAdd((int *)&test2, 1);
    int end = test2;
    int diff = end - start;
    atomicMin(&test3, diff);
    }
}

int main() {

   kernel<<<1, 1024>>>();
   int result;
   cudaMemcpyFromSymbol(&result, test3, sizeof(int));
   printf("result = %d threads\n", result);
   return 0;
}

我编译:

nvcc -O3 -arch=sm_20 -o t331 t331.cu


$ b b

我称之为统计,因为它需要大量的迭代( LOOPS )以产生正确的估计(32)。随着迭代计数减少,估计增加。

I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.

我们可以通过取消注释在内核中注释的行来应用额外的扭曲同步杠杆。对于我的测试用例*,如果该行未注释,即使 LOOPS = 1

We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1

测试用例是CUDA 5,Quadro5000,RHEL 5.5

*my test case is CUDA 5, Quadro5000, RHEL 5.5

这篇关于经验地确定经线中有多少线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆