练习计算CUDA的网格大小 [英] Practice computing grid size for CUDA
问题描述
dim3 block(4, 2)
dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y);
我在第53页的Professional CUDA C编程中找到了此代码。这只是一个简单的例子矩阵乘法。 nx
是列数, ny
是行数。
I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx
is the number of columns and ny
is the number of rows.
您能解释一下如何计算网格大小吗?为什么将 block.x
添加到 nx
然后减去 1
?
Can you explain how the grid size is computed? Why is block.x
added to nx
and then subtracted by 1
?
有一个预览( https://books.google.com/books?id=_Z7rnAEACAAJ&printsec=frontcover#v=onepage&q&f=false )但页面缺少53。
There is a preview (https://books.google.com/books?id=_Z7rnAEACAAJ&printsec=frontcover#v=onepage&q&f=false) but page 53 is missing.
推荐答案
这是用于确定每个维度中最小块数的标准CUDA习惯用法(网格 ),以完全覆盖所需的输入。可以表示为 ceil(nx / block.x)
,即找出需要多少块才能覆盖所需的大小,然后四舍五入。
This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x)
, that is, figure out how many blocks are needed to cover the desired size, then round up.
但是完整的浮点除法和ceil比必要的代价昂贵。相反,由于C将整数除法定义为底数运算,因此可以在除数之前添加除数-1,以获得天花板运算的效果。
But full floating point division and ceil is more expensive than necessary. Instead, since C defines integer division as a "floor" operation, you can add the divisor - 1 before dividing to the get the effect of a "ceiling" operation.
尝试一些示例:如果 nx = 10
,则 nx + block.x-1
为13,并且是整数divison,您需要3个大小为4的块。
Try a few examples: If nx = 10
, then nx + block.x - 1
is 13, and by integer divison, you need 3 blocks of size 4.
正如您在注释中所指出的,+ block.x将楼层推高到天花板,而-1表示相除的数字完美地放入除数例如当我们实际想要(12 + 4-1)/ 4时,(12 + 4)/ 4将为4
As you noted in the comment, +block.x pushes up floor to ceiling and the -1 is for numbers that divide perfectly into the divisor. e.g. (12 + 4)/4 would be 4 when we actually want (12+4-1)/4 which 3
这篇关于练习计算CUDA的网格大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!