GPU内存带宽理论值与实际值 [英] GPU Memory bandwidth theoretical vs practical

查看：657 发布时间：2020/4/30 12:08:48 cuda opencl linear-algebra gpgpu bandwidth

本文介绍了GPU内存带宽理论值与实际值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

作为在GPU上运行的算法分析的一部分，我觉得自己正在消耗内存带宽.

As part of an algorithm profiling running on GPU I feel that I'm hitting the memory bandwidth.

我有几个复杂的内核执行一些复杂的操作(稀疏矩阵乘法，归约等)和一些非常简单的内核，当我计算读/写的总数据时，似乎所有(重要的)内核带宽都达到了约79GB/s.对于它们中的每一个，无论它们的复杂性如何，而理论上的GPU带宽为112GB/s(nVidia GTX 960)

I have several complex kernels performing some complicated operations (sparse matrix multiplications, reduction etc) and some very simple ones and it seems that all (significant ones) hit ~79GB/s bandwidth wall when I calculate the total data read/written for each one of them, regardless the complexity of them, while the theoretical GPU bandwidth is 112GB/s (nVidia GTX 960)

该数据集非常大，可处理约10,000,000个浮点条目的向量，因此我从COMMAND_START和COMMAND_END之间的clGetEventProfilingInfo获得了良好的度量/统计数据.在算法运行期间，所有数据都保留在GPU内存中，因此几乎没有主机/设备内存传输(也没有通过性能分析计数器进行测量)

The data set is very large operating on vectors of ~10,000,000 float entries so I get good measurements/statistics from clGetEventProfilingInfo between COMMAND_START and COMMAND_END. All the data remains in GPU memory during algorithm run so there virtually no host/device memory transfer (also it is not measured by profiling counters)

即使对于解决x=x+alpha*b的非常简单的内核(见下文)，其中x和b都是约10,000,000个条目的巨大矢量，我仍未接近理论带宽(112GB/s)，而是在约70％的最大容量(〜79GB/s)

Even for a very simple kernel (see below) that solves x=x+alpha*b where x and b are huge vectors of ~10,000,000 entries, I don't get close to the theoretical bandwidth (112GB/s) but rather is running on ~70% of the maximum (~79GB/s)

__kernel void add_vectors(int N,__global float *x,__global float const *b,float factor)
{
    int gid = get_global_id(0);
    if(gid < N)
        x[gid]+=b[gid]*factor;
}

我为此特定内核每次运行的数据传输量计算为N *(2 + 1)* 4:

I calculate data transfer for this particular kernel per run as N * (2 + 1) * 4:

N-向量的大小=〜10,000,000
每个向量条目2个负载和1个存储区
sizeof浮点数为4

我希望对于这样一个简单的内核，我需要更加接近带宽限制，我会错过什么?

I expected that for such a simple kernel I need to get much closer to the bandwidth limits, what do I miss?

P.S .:我从相同算法的CUDA实现中获得了相似的数字

P.S.: I get similar numbers from CUDA implementation of the same algorithm

GPU内存带宽理论值与实际值 [英] GPU Memory bandwidth theoretical vs practical

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

GPU内存带宽理论值与实际值 [英] GPU Memory bandwidth theoretical vs practical

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭