有关CUDA中从块到SM分布的详细信息的问题 [英] A question about the details about the distribution from blocks to SMs in CUDA

查看:197
本文介绍了有关CUDA中从块到SM分布的详细信息的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我以具有1.3计算能力的硬件为例。

Let me take the hardware with computation ability 1.3 as an example.

有30个SM。然后最多可以同时运行240个块(考虑到寄存器和共享内存的限制,对块数的限制可能要低得多)。超过240的那些块必须等待可用的硬件资源。

30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources.

我的问题是,何时将超过240的那些块分配给SM。完成前240个模块中的一些块之后?还是当前240个块中的所有完成时?

My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are finished?

我写了这样的一段代码。

I wrote such a piece of code.

#include<stdio.h>
#include<string.h>
#include<cuda_runtime.h>
#include<cutil_inline.h>

const int BLOCKNUM = 1024;
const int N=240;
__global__ void kernel ( volatile int* mark ) {
    if ( blockIdx.x == 0 ) while ( mark[N] == 0 );
    if ( threadIdx.x == 0 ) mark[blockIdx.x] = 1;
}

int main() {
    int * mark;
    cudaMalloc ( ( void** ) &mark, sizeof ( int ) *BLOCKNUM );
    cudaMemset ( mark, 0, sizeof ( int ) *BLOCKNUM );
    kernel <<< BLOCKNUM, 1>>> ( mark );
    cudaFree ( mark );
    return 0;
}

此代码导致死锁,无法终止。但是,如果我将N从240更改为239,则代码可以终止。所以我想知道有关块调度的一些细节。

This code causes a deadlock and fails to terminate. But if I change N from 240 to 239, the code is able to terminate. So I want to know some details about the scheduling of blocks.

推荐答案

在GT200上,它已经通过微基准测试进行了演示。每当SM淘汰了它正在运行的所有当前活动块时,就计划安排新块。因此答案是当 some 块完成时,调度粒度为SM级。似乎已经达成共识,即Fermi GPU具有比上一代硬件更好的调度粒度。

On the GT200, it has been demonstrated through micro-benchmarking that new blocks are scheduled whenever a SM has retired all the currently active blocks which it was running. So the answer is when some blocks are finished, and the scheduling granularity is SM level. There seems to be a consensus that Fermi GPUs have a finer scheduling granularity than previous generations of hardware.

这篇关于有关CUDA中从块到SM分布的详细信息的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆