[CUDA]确定每个块的线程,没有。块和网格大小 [英] [CUDA]Determining threads per block, no. of blocks and Grid Size

查看:80
本文介绍了[CUDA]确定每个块的线程,没有。块和网格大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何确定每个块的线程,没有。 CUDA设备上的块?



例如:我需要将两个单维数组A,B相乘并将结果复制到C数组中。



How to determine threads per block, no. of blocks on a CUDA Device?

For example: i need to multiply two single dimensioned arrays A, B and Copy the result into C Array.

int N = 10; //Array Containing Maximum of 10 elements
size_t size = N*sizeof(float);
...
cudaMalloc((**void &&)a_d, size);
cudaMalloc((**void &&)b_d, size);
cudaMalloc((**void &&)c_d, size);
...
...
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice);

//How to determine no. of threads here???
int threadsPerBlock = ???
int noOfBlocks = ??

fmultiply<<<threadsPerBlock, noOfBlocks>>>(a_d, b_d, c_d);

cudaMemcpy(c_d, c_h, size, cudaMemcpyDeviceToHost);
...
...

推荐答案

这取决于您要处理的数据集和要分配给每个块的线程数。



在您的示例中,您只处理20个元素(2 * 10)。为了获得最佳性能,每个块应该有大约256个线程。在这种情况下,256个线程比计算所需的线程多246个,因为每个内核线程将处理A和B中的一个元素并将结果输出到C.



通常,通过快速简便的方法计算出需要多少块,方法是将总数据除以要分配给每个块的线程数。



块数=(TOTAL_ELEMENTS / NUMBER_OF_THREADS)
It depends on the data set that you want to process and the number of the threads that you want to assign to each block.

In your example you are only processing 20 elements (2 * 10). You are supposed to have about 256 threads per block for max performance. In this case 256 threads is 246 more threads than you need for your calculation, because each kernel thread will process one element from A and B and output the result to C.

Generally the quick and easy way to figure out how many blocks you need is by dividing your total data by the number of threads you want to assign to each block.

number of blocks = (TOTAL_ELEMENTS/NUMBER_OF_THREADS)


如果数组很大,你应该能够将threadsPerBlock设为512.



Int noOfBlocks = size / threadsPerBlock;



If(size%threadsPerBlock)noOfBlocks ++;



这需要考虑到大小不是noOfBlocks的倍数的情况;



您可以查询CUDA设备以获得可以使用的最大线程数。据报道,下一代设备将允许1024。
if the array is large you should e able to have threadsPerBlock as 512.

Int noOfBlocks = size / threadsPerBlock;

If ( size % threadsPerBlock) noOfBlocks++;

This takes into account the cases where size is not a multiple of noOfBlocks;

You can query the CUDA device for maximum number of threads you can use. It is reported that next gen devices will allow 1024.


这篇关于[CUDA]确定每个块的线程,没有。块和网格大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆