如何计算实现的CUDA内核带宽 [英] How to calculate the achieved bandwidth of a CUDA kernel

查看:242
本文介绍了如何计算实现的CUDA内核带宽的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我想要一个NVIDIA Tesla C1060,它有一个最大带宽为102.4 GB / s 。在我的内核中,我对全局内存有以下访问:

  ... 
for(int k = 0; k> 4000; k ++){
float result =(in_data [index] -loc_mem [k])*(in_data [index] -loc_mem [k]);
....
}
out_data [index] = result;
out_data2 [index] = sqrt(result);
...

我计算每个线程4000 * 2 + 2访问全局内存。拥有1.000.000个线程和所有访问是浮动的我有〜32GB的全局内存访问(入站和出站添加)。由于我的内核只需要0.1s,我会归档〜320GB / s,这是高于最大带宽,因此在我的计算/假设有一个错误。我假设,CUDA做一些缓存,所以不是所有的内存访问计数。现在我的问题:




  • 我的错误是什么?

  • 哪些不是?

  • 是否正确,我不计算对寄存器,本地,共享和常量内存的访问?

  • 我可以使用CUDA分析器来获得更简单和更准确的结果吗?我需要使用哪些计数器?



分析器输出:

  method gputime cputime occupancy instruction warp_serial memtransfer 
memcpyHtoD 10.944 17 16384
fill 64.32 93 1 14556 0
fill 64.224 83 1 14556 0
memcpyHtoD 10.656 11 16384
fill 64.064 82 1 14556 0
memcpyHtoD 1172.96 1309 4194304
memcpyHtoD 10.688 12 16384
cu_more_regT 93223.906 93241 1 40716656 0
memcpyDtoH 1276.672 1974 4194304
memcpyDtoH 1291.072 2019 4194304
memcpyDtoH 1278.72 2003 4194304
memcpyDtoH 1840 3172 4194304

新问题:
- 当4194304Bytes = 4Bytes * 1024 * 1024数据点= 4MB和 gpu_time 〜= 0.1 s时,我实现10 * 40MB / s = 400MB / s。这似乎很低。错误在哪里?



p.s。



姐妹问题:如何计算内核的Gflops

解决方案


  • 你并没有一次运行1.000.000的线程。您可以执行约32GB的全局内存访问,其中带宽将由SM中运行(读取)的当前线程和读取的数据大小给出。

  • 全局内存中的所有访问缓存在L1和L2中,除非您向编译器指定未缓存的数据。

  • 我认为这样。实现的带宽与全局内存相关。

  • 我建议使用可视化分析器来查看读/写/全局内存带宽。



Visual Profiler中的默认计数器为您提供了足够的信息来了解您的内核(内存带宽,共享内存库冲突,执行指令...)。



关于您的问题,要计算已实现的全局内存吞吐量:


计算Visual Profiler 。 DU-05162-001_v02 | 2010年10月。第56页,表7.支持的派生统计信息。



全局内存读取吞吐量,以千兆字节每秒为单位。对于 2.0 ,计算方式为:((gld_32 * 32)+(gld_64 * 64)+(gld_128 * 128))* TPC)/ gputime对于 compute capability> = 2.0 ((DRAM reads)* 32)/ gputime


希望有帮助。


I want a measure of how much of the peak memory bandwidth my kernel archives.

Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:

    ...
    for(int k=0;k>4000;k++){
        float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
        ....
    }
    out_data[index]=result;
    out_data2[index]=sqrt(result);
    ...

I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:

  • What is my error?
  • What accesses to global memory are cached and which are not?
  • Is it correct that I don't count access to registers, local, shared and constant memory?
  • Can I use the CUDA profiler for easier and more accurate results? Which counters would I need to use? How would I need to interpret them?

Profiler output:

method              gputime    cputime  occupancy instruction warp_serial memtransfer
memcpyHtoD           10.944         17                                          16384
fill                  64.32         93          1       14556           0
fill                 64.224         83          1       14556           0
memcpyHtoD           10.656         11                                          16384
fill                 64.064         82          1       14556           0
memcpyHtoD          1172.96       1309                                        4194304
memcpyHtoD           10.688         12                                          16384
cu_more_regT      93223.906      93241          1    40716656           0
memcpyDtoH         1276.672       1974                                        4194304
memcpyDtoH         1291.072       2019                                        4194304
memcpyDtoH          1278.72       2003                                        4194304
memcpyDtoH             1840       3172                                        4194304

New question: - When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time ~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?

p.s. Tell me if you need other counters for your answer.

sister question: How to calculate Gflops of a kernel

解决方案

  • You do not really have 1.000.000 of threads running at once. You do ~32GB of global memory accesses where the bandwidth will be given by the current threads running (reading) in the SMs and the size of the data read.
  • All accesses in global memory are cached in L1 and L2 unless you specify un-cached data to the compiler.
  • I think so. Achieved bandwidth is related to global memory.
  • I will recommend use the visual profiler to see the read/write/global memory bandwidth. Would be interesting if you post your result :).

Default counters in Visual Profiler gives you enough information to get an idea about your kernel (memory bandwidth, shared memory bank conflicts, instructions executed...).

Regarding to your question, to calculate the achieved global memory throughput:

Compute Visual Profiler. DU-05162-001_v02 | October 2010. User Guide. Page 56, Table 7. Supported Derived Statistics.

Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / gputime

Hope this help.

这篇关于如何计算实现的CUDA内核带宽的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆