如何优化2个相同的,占用率达到50%且可以在CUDA中同时运行的内核? [英] How to optimize 2 identical kernels with 50% occupancy that could run concurrently in CUDA?

查看:82
本文介绍了如何优化2个相同的,占用率达到50%且可以在CUDA中同时运行的内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在CUDA中有2个相同的内核,它们报告了50%的理论占用率,并且可以同时运行.但是,在不同的流中调用它们将显示顺序执行.

I have 2 identical kernels in CUDA that report 50% theoretical occupancy and could be run concurrently. However, calling them in different streams shows sequential execution.

每个内核调用的网格和块尺寸如下:

Each kernel call has the grid and block dimensions as follows:

Grid(3, 568, 620)
Block(256, 1, 1 )
With 50 registers per thread.

这导致每个SM线程太多,每个块太多寄存器.

This results in too many threads per SM and too many registers per block.

我接下来的优化工作应该集中在减少内核使用的寄存器数量上吗?

Should I focus my next efforts of optimization in reducing the number of registers used by the kernel?

或者将网格划分为许多较小的网格是否有意义,从而有可能允许2个内核同时发布和运行.我在这里仍然会成为每个块的寄存器数量吗?

Or does it make sense to split the grid in many smaller grids, potentially allowing for the 2 kernels to be issued and to run concurrently. Will I the number of register per block still pose an issue here?

注意-deviceQuery报告:

Note - deviceQuery reports:

MAX_REGISTERS_PER_BLOCK 65K
MAX_THREADS_PER_MULTIPROCESSOR 1024
NUMBER_OF_MULTIPROCESSORS 68

推荐答案

我在CUDA中有2个相同的内核,它们报告了50%的理论占用率...

I have 2 identical kernels in CUDA that report 50% theoretical occupancy ...

...并且可以同时运行

... and could be run concurrently

这不是占用率的暗示,也不正确.

That isn't what occupancy implies and is not correct.

50%的占用率并不意味着您拥有50%的未使用资源,不同的内核可以同时使用这些资源.这意味着当运行并发扭曲的最大理论数量的50%时,您的代码将耗尽资源.如果您已耗尽资源,则无法再运行任何扭曲,无论它们是来自该内核还是其他内核.

50% occupancy doesn't mean you have 50% unused resources which a different kernel could use concurrently. It means your code exhausted a resource when running 50% of the maximum theoretical number of concurrent warps. If you have exhausted a resource, you can't run any more warps, be they from that kernel or any other.

但是,在不同的流中调用它们会显示顺序执行.

However, calling them in different streams shows sequential execution.

正是出于上述原因,这应该是我们所期望的

That is exactly what should be expected, for the reasons above

每个内核调用的网格和块尺寸如下:

Each kernel call has the grid and block dimensions as follows:

Grid(3, 568, 620)
Block(256, 1, 1 )
With 50 registers per thread.

您给了一个启动1041600块的内核.这甚至比最大的GPU可以同时运行要高出几个数量级,这意味着在如此巨大的网格中并发内核执行的范围基本上为零.

You gave a kernel which launches 1041600 blocks. That is several orders of magnitude more than even the largest GPUs can run concurrently, meaning that scope for concurrent kernel execution for such an enormous grid is basically zero.

这导致每个SM线程太多,每个块太多寄存器.

This results in too many threads per SM and too many registers per block.

注册压力可能是限制入住的原因

The register pressure is probably what is limiting the occupancy

我接下来的优化工作应该集中在减少内核使用的寄存器数量上吗?

Should I focus my next efforts of optimization in reducing the number of registers used by the kernel?

鉴于并发内核执行的目标是不可能的,我认为目标应该是使该内核尽可能快地运行.您的操作方式是特定于代码的.在某些情况下,寄存器优化可以提高占用率和性能,但是有时所有发生的事情都是您溢出到本地内存,这会损害性能.

Given that the goal of concurrent kernel execution is impossible, I would think the objective should be to make this kernel run as fast as possible. How you do that is code specific. In some cases register optimization can increase occupancy and performance, but sometimes all that happens is you get spills to local memory which hurts performance.

或者将网格划分为许多较小的网格是否有意义,从而有可能允许2个内核同时发布并同时运行.

Or does it make sense to split the grid in many smaller grids, potentially allowing for the 2 kernels to be issued and to run concurrently.

当您说很多"时,您将意味着成千上万个网格,这将意味着如此多的启动和调度延迟,如果您能够设法实现可能的并发内核执行,则我无法想象这样做会有任何好处.

When you say "many" you would be implying thousands of grids, and that would imply so much launch and scheduling latency that I could not imagine any benefit in doing so, if you could manage to get to the point where concurrent kernel execution was possible.

这篇关于如何优化2个相同的,占用率达到50%且可以在CUDA中同时运行的内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆