GPU内存带宽理论值与实际值 [英] GPU Memory bandwidth theoretical vs practical

查看:657
本文介绍了GPU内存带宽理论值与实际值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为在GPU上运行的算法分析的一部分,我觉得自己正在消耗内存带宽.

As part of an algorithm profiling running on GPU I feel that I'm hitting the memory bandwidth.

我有几个复杂的内核执行一些复杂的操作(稀疏矩阵乘法,归约等)和一些非常简单的内核,当我计算读/写的总数据时,似乎所有(重要的)内核带宽都达到了约79GB/s.对于它们中的每一个,无论它们的复杂性如何,而理论上的GPU带宽为112GB/s(nVidia GTX 960)

I have several complex kernels performing some complicated operations (sparse matrix multiplications, reduction etc) and some very simple ones and it seems that all (significant ones) hit ~79GB/s bandwidth wall when I calculate the total data read/written for each one of them, regardless the complexity of them, while the theoretical GPU bandwidth is 112GB/s (nVidia GTX 960)

该数据集非常大,可处理约10,000,000个浮点条目的向量,因此我从COMMAND_STARTCOMMAND_END之间的clGetEventProfilingInfo获得了良好的度量/统计数据.在算法运行期间,所有数据都保留在GPU内存中,因此几乎没有主机/设备内存传输(也没有通过性能分析计数器进行测量)

The data set is very large operating on vectors of ~10,000,000 float entries so I get good measurements/statistics from clGetEventProfilingInfo between COMMAND_START and COMMAND_END. All the data remains in GPU memory during algorithm run so there virtually no host/device memory transfer (also it is not measured by profiling counters)

即使对于解决x=x+alpha*b的非常简单的内核(见下文),其中x和b都是约10,000,000个条目的巨大矢量,我仍未接近理论带宽(112GB/s),而是在约70%的最大容量(〜79GB/s)

Even for a very simple kernel (see below) that solves x=x+alpha*b where x and b are huge vectors of ~10,000,000 entries, I don't get close to the theoretical bandwidth (112GB/s) but rather is running on ~70% of the maximum (~79GB/s)

__kernel void add_vectors(int N,__global float *x,__global float const *b,float factor)
{
    int gid = get_global_id(0);
    if(gid < N)
        x[gid]+=b[gid]*factor;
}

我为此特定内核每次运行的数据传输量计算为N *(2 + 1)* 4:

I calculate data transfer for this particular kernel per run as N * (2 + 1) * 4:

  • N-向量的大小=〜10,000,000
  • 每个向量条目2个负载和1个存储区
  • sizeof浮点数为4

我希望对于这样一个简单的内核,我需要更加接近带宽限制,我会错过什么?

I expected that for such a simple kernel I need to get much closer to the bandwidth limits, what do I miss?

P.S .:我从相同算法的CUDA实现中获得了相似的数字

P.S.: I get similar numbers from CUDA implementation of the same algorithm

推荐答案

我认为,评估是否已达到峰值带宽的一种更现实的方法是将您与简单的D2D副本得到的结果进行比较.

I think a more realistic way to evaluate if you have reached the peak bandwidth is to compare what you get with a simple D2D copy.

例如,您的内核读取x和b一次并写入x一次,因此执行时间的上限应为从b复制到x一次的1.5倍的时间.如果您发现时间远高于1.5倍,则意味着您可能还有改进的空间.在此内核中,工作是如此简单,以至于开销(开始和结束函数,计算索引等)可能会限制性能.如果这是一个问题,您可能会发现借助网格步幅循环来增加每个线程的工作量.

For example your kernel read x and b once and write x once, so the upper limit of the execution time should be 1.5x time of copying from b to x once. If you find the time is much higher than 1.5x, it means you probably have space to improve. In this kernel the work is so simple that the overhead (starting and ending the function, computing the index, etc.) may limit the performance. If this is an issue, you may find increasing the work per thread with a grid stride loop helps.

https://devblogs .nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

关于理论带宽,至少应考虑ECC(如果已启用)的开销.

As for the theoretical bandwidth, at least you should consider the overhead of ECC if it is enabled.

这篇关于GPU内存带宽理论值与实际值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆