为什么我的CPU做矩阵运算要比GPU快? [英] Why is my CPU doing matrix operations faster than GPU instead?

查看:389
本文介绍了为什么我的CPU做矩阵运算要比GPU快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试验证GPU是否可以通过CPU进行矩阵运算时,我得到了意外的结果.根据我的经验结果,CPU的性能要优于GPU,这让我感到困惑.

我分别使用cpu和gpu进行矩阵乘法.编程环境为MXNet和cuda-10.1.

使用gpu:

 将mxnet导入为mx从mxnet导入ndx = nd.random.normal(shape =(100000,100000),ctx = mx.gpu())y = nd.random.normal(shape =(100000,100000),ctx = mx.gpu())%timeit nd.dot(x,y) 

每个循环50.8 µs±1.76 µs(平均±标准偏差,共运行7次,每个10000个循环)

使用cpu:

  x1 = nd.random.normal(shape =(100000,100000),ctx = mx.cpu())y1 = nd.random.normal(shape =(100000,100000),ctx = mx.cpu())%timeit nd.dot(x1,y1) 

每个循环33.4 µs±1.54 µs(平均±标准偏差,共运行7次,每个10000个循环)

为什么CPU速度更快?我的CPU型号是i5-6300HQ,GPU型号是Nividia GTX 950M.

解决方案

TLDR:您的矩阵乘法实际上没有运行:)

MXNet是一个异步框架,该框架将工作请求堆积在其执行引擎根据需要运行而异步处理的队列中.因此,您要衡量的只是发送请求所花的时间,而不是执行它所花费的时间.这就是为什么它是如此之小(在100k * 100k矩阵上微秒会出奇地快)并且对于CPU和GPU来说大致相等的原因.要强制执行,您需要添加一个强制产生结果的调用,例如 print nd.dot(x,y).wait_to_read().在此处看到与基准测试非常相似的代码 https://github.com/ThomasDelteil/MXNetParisWorkshop/blob/master/FromNDArrayToTrainedModel.ipynb

其他评论:

  1. 使用GPU相对于CPU的收益取决于并行机会.在简单的任务上,获得的收益可能很小不存在的.CPU核心频率实际上要大2至3倍比GPU频率高(您的i5-6300HQ在3.2GHz提升下可达到2.3GHzGTX 950M具备0.9GHz和1.1GHz增强能力的情况下.

  2. MXNet ndarray在CPU上的矩阵代数运算中非常快,因为(1)它的异步范例优化了计算顺序;(2)它的C ++后端并行运行事物;(3)我相信默认的MXNet构建随Intel MKL一起提供,这将大大提高Intel CPU的代数能力( https://medium.com/apache-mxnet/mxnet-boosts-cpu-performance-with-mkl-dnn-b4b7c8400f98 ).例如,与Numpy相比,它具有在同一API中的GPU上运行计算的能力也是很大的优势.

  3. 我认为您的测试不会在GPU上运行:实例化如此大的矩阵NVIDIA Tesla V100(16GB硬盘,比GTX 950M高出4倍)在大张量大小错误"

When I tried to verify that the GPU does matrix operations over the CPU, I got unexpected results.CPU performs better than GPU according to my experience result, it makes me confused.

I used cpu and gpu to do matrix multiplication respectively.Programming environment is MXNet and cuda-10.1.

with gpu:

import mxnet as mx
from mxnet import nd
x = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
y = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
%timeit nd.dot(x,y)

50.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

with cpu:

x1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
y1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
%timeit nd.dot(x1,y1)

33.4 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Why CPU faster? My CPU model is i5-6300HQ and GPU model is Nividia GTX 950M.

解决方案

TLDR: Your matrix multiplication is actually not running :)

MXNet is an asynchronous framework that piles work requests in a queue treated asynchronously on a need-to-run basis by its execution engine. So what you're measuring is only the time it took to send the request, not to execute it. That's why it is so small (microseconds on a 100k*100k matrix would be surprisingly fast) and roughly equal for both CPU and GPU. To force execution, you need to add a call that forces production of a result, for example a print or a nd.dot(x, y).wait_to_read(). See here a code very similar to your benchmark https://github.com/ThomasDelteil/MXNetParisWorkshop/blob/master/FromNDArrayToTrainedModel.ipynb

Extra comments:

  1. The gain of using GPU vs CPU comes with the size of the parallelism opportunity. On simple tasks, that gain can be small to non existent. CPU core frequencies are actually 2 to 3 times bigger than GPU frequencies (your i5-6300HQ does 2.3GHz with 3.2GHz boost ability while your GTX 950M does 0.9GHz with 1.1GHz boost ability).

  2. MXNet ndarray is very fast at matrix algebra on CPU, because (1) its asynchronous paradigm optimizes the order of computation (2) its C++ backend runs things in parallel and (3) I believe the default MXNet build comes with Intel MKL, which significantly boosts algebra capacities of Intel CPUs (https://medium.com/apache-mxnet/mxnet-boosts-cpu-performance-with-mkl-dnn-b4b7c8400f98). Its ability to run compute on GPU within the same API is also a big strength over Numpy for example.

  3. I don't think your test will run on GPU: instantiating such a big matrix on an NVIDIA Tesla V100 (16GB men, 4x more than a GTX 950M) runs in a "large tensor size error"

这篇关于为什么我的CPU做矩阵运算要比GPU快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆