numpy.dot比本地C ++ 11慢100倍 [英] numpy.dot 100 times slower than native C++11

查看:181
本文介绍了numpy.dot比本地C ++ 11慢100倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有Matlab的背景,一年前当我买一台笔记本电脑时,我仔细选择了一台具有强大计算能力的笔记本电脑,该机器有4个线程,并且在2.4GHz时为我提供了8个线程.该机器证明自己非常强大,并且使用简单的parfor循环,我可以利用所有处理器线程,通过这些线程,我在许多问题和实验上的速度都提高了近8.

I have a Matlab-background, and when I bought a laptop a year ago, I carefully selected one which has a lot of compute power, the machine has 4 threads and it offers me 8 threads at 2.4GHz. The machine proved itself to be very powerful, and using simple parfor-loops I could utilize all the processor threads, with which I got a speedup near 8 for many problems and experiments.

我正在用numpy进行试验的这个美好的星期日,人们常常告诉我,使用libblas甚至可以使用多个内核和库(例如OpenMP)有效地实现numpy的核心业务(使用OpenMP,您可以使用c创建类似于parfor的循环样式的杂注.)

This nice Sunday I was experimenting with numpy, people often tell me that the core business of numpy is implemented efficiently using libblas, and possibly even using multiple cores and libraries like OpenMP (with OpenMP you can create parfor-like loops using c-style pragma's).

这是许多数值和机器学习算法的通用方法,您可以使用昂贵的高级运算(例如矩阵乘法)来表达它们,但是可以使用Matlab和python这样的昂贵的高级语言来表达它们的舒适性.而且,c(++)允许我们绕过GIL.

This is the general approach for many numerical and machine learning algorithms, you express them using expensive high-level operations like matrix multiplications, but in an expensive, high-level language like Matlab and python for comfort. Moreover, c(++) allows us to bypass the GIL.

所以,很酷的一点是,每当您使用numpy时,线性代数材料都应该在python中真正快速地处理.您只需承担一些函数调用的开销,但是如果背后的计算量很大,则可以忽略不计.

So the cool part is that linear algebra-stuff should process really fast in python, whenever you use numpy. You just have the overhead of some function-calling, but then if the calculation behind it is large, that's negligible.

因此,我什至没有碰到并不是所有事物都可以用线性代数或其他numpy运算来表示的话题,我给了它一个旋转:

So, without even touching the topic that not everything can be expressed in linear algebra or other numpy operations, I gave it a spin:

t = time.time(); numpy.dot(range(100000000), range(100000000)); print(time.time() - t)
40.37656021118164

所以,在这40秒钟中,我看到机器上的8个线程中的一个工作100%,而其他线程接近0%.我不喜欢这样,但是即使有一个线程在工作,我也希望它能在大约0.something秒内运行.点积的确是100M +和*,因此我们有2400M/100M =每秒24个时钟滴答,表示一个+,一个*和其他开销.

So I, these 40 seconds I saw ONE of the 8 threads on my machine working for 100%, and the others were near 0%. I didn't like this, but even with one thread working I'd expect this to run in approximately 0.something seconds. The dot-product does 100M +'es and *'es, so we have 2400M / 100M = 24 clock ticks per second for one +, one * and whatever overhead.

尽管如此,该算法仍需要40 * 24 =约= 1000个滴答(!!!!!)的+,*和开销.让我们在C ++中做到这一点:

Nevertheless, the algorithm needs 40* 24 =approx= 1000 ticks (!!!!!) for the +, * and overhead. Let's do this in C++:

#include<iostream>

int main() {
  unsigned long long result = 0;
  for(unsigned long long i=0; i < 100000000; i++)
    result += i * i;
  std::cout << result << '\n';
}

BLITZ:

herbert@machine:~$ g++ -std=c++11 dot100M.cc 
herbert@machine:~$ time ./a.out
662921401752298880

real    0m0.254s
user    0m0.254s
sys 0m0.000s

0.254秒,几乎比numpy.dot快100倍.

0.254 seconds, almost 100 times faster than numpy.dot.

我认为,也许python3范围生成器是最慢的部分,所以我先将所有100M数字存储在std :: vector中(使用迭代push_back),然后对其进行迭代,从而使c ++ 11实现受到了限制.这要慢得多,花了不到4秒的时间,仍然快了10倍.

I thought, maybe the python3 range-generator is the slow part, so I handicapped my c++11 implementation by storing all 100M numbers in a std::vector first (using iterative push_back's), and than iterating over it. This was a lot slower, it took a little below 4 seconds, which still is 10 times faster.

我在ubuntu上使用'pip3 install numpy'安装了numpy,它开始使用gcc和gfortran进行编译一段时间,而且我看到提到了blas-header文件通过编译器输出的情况.

I installed my numpy using 'pip3 install numpy' on ubuntu, and it started compiling for some time, using both gcc and gfortran, moreover I saw mentions of blas-header files passing through the compiler output.

出于什么原因,numpy.dot如此缓慢?

For what reason is numpy.dot so extremely slow?

推荐答案

因此,您的比较是不公平的.在您的python示例中,您首先生成两个范围对象,将它们转换为numpy-arrays,然后进行标量积.计算最少.这是我计算机的号码:

So your comparison is unfair. In your python example, you first generate two range objects, convert them to numpy-arrays and then doing the scalar product. The calculation takes the least part. Here are the numbers for my computer:

>>> t=time.time();x=numpy.arange(100000000);numpy.dot(x,x);print time.time()-t
1.28280997276

并且没有生成数组:

>>> t=time.time();numpy.dot(x,x);print time.time()-t
0.124325990677

为完成此任务,C版大致需要相同的时间:

For completion, the C-version takes roughly the same time:

real    0m0.108s
user    0m0.100s
sys 0m0.007s

这篇关于numpy.dot比本地C ++ 11慢100倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆