剖析并发CUDA内核 [英] Profile concurrent CUDA kernels
问题描述
我对获取并发cuda内核的内存性能计数器感兴趣。我尝试使用几个nvprof选项,例如-全部度量
和-print-gpu-trace
。输出似乎表明内核不再并发。每个内核的并发性能指标看起来与单独运行每个内核的性能指标几乎完全相同。我认为这些并发内核是按顺序运行的。如何获得并发内核(例如L2缓存)的内存性能指标计数器?
I am interested in getting memory performance counter of concurrent cuda kernels. I tried to use several nvprof options like, --metrics all
and --print-gpu-trace
. The output seems to indicate that kernels are not concurrent any more. And concurrent performance metrics of each kernel look almost exactly the same as those running each kernel alone. I think that these concurrent kernels ran in sequence. How could I get memory performance metrics counter of concurrent kernels, for example L2 cache?
推荐答案
让内核同时执行。但是,您可以尝试以下解决方法:
You cannot do per-kernel profiling while having the kernels execute concurrently. You can however try the following workarounds:
- 仅进行跟踪。如果您未指定
-指标
或-events
,则nvprof将仅执行跟踪运行。在这种情况下,nvprof将同时运行内核,但您只会获得内核计时,而不是度量标准/事件数据。 - 如果您拥有NVIDIA Tesla GPU(与GeForce或Quadro相对) ),则可以使用 CUPTI库的
cuptiSetEventCollectionMode(CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS)
API,用于在内核并行运行时对所需的指标进行采样。但是,这仅允许您在该采样间隔中获取汇总的度量标准/事件数据-这意味着您将无法将该数据与各个内核相关联。 CUPTI附带了一个名为event_sampling
的代码示例,演示了如何使用此API。 - 配置所需的指标/事件,以及让内核序列化。对于某些指标/事件,您可以简单地将这些值相加,以估计并发执行期间的行为。
- Do only tracing. If you don't specify
--metrics
or--events
, nvprof will only do a tracing run. In this case, nvprof will run the kernels concurrently, but you will only get kernel timings - not metric/event data. - If you own an NVIDIA Tesla GPU (as opposed to GeForce or Quadro), you can use the CUPTI library's
cuptiSetEventCollectionMode(CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS)
API to sample the metrics you want while the kernels are running concurrently. However, this will only allow you to get the aggregate metric/event data in that sampling interval - which means that you will not be able to correlate this data to individual kernels. CUPTI ships with a code sample calledevent_sampling
, that demonstrates how to use this API. - Profile the metrics/events you want, and let the kernels serialize. For some metrics/events, you may be able to simply sum up the values to estimate behavior during concurrent execution.
这篇关于剖析并发CUDA内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!