Linux和Windows之间的numpy性能差异 [英] numpy performance differences between Linux and Windows

查看:308
本文介绍了Linux和Windows之间的numpy性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在两台不同的计算机上运行 sklearn.decomposition.TruncatedSVD() ,并了解性能差异.

I am trying to run sklearn.decomposition.TruncatedSVD() on 2 different computers and understand the performance differences.

计算机1 (Windows 7,物理计算机)

computer 1 (Windows 7, physical computer)

OS Name Microsoft Windows 7 Professional
System Type x64-based PC
Processor   Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 
8 Logical Installed Physical Memory (RAM)   8.00 GB
Total Physical Memory   7.89 GB

计算机2 (Debian,在亚马逊云上)

computer 2 (Debian, on amazon cloud)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8

width: 64 bits
capabilities: ldt16 vsyscall32
*-core
   description: Motherboard
   physical id: 0
*-memory
   description: System memory
   physical id: 0
   size: 29GiB
*-cpu
   product: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
   vendor: Intel Corp.
   physical id: 1
   bus info: cpu@0
   width: 64 bits

计算机3 (Windows 2008R2,在亚马逊云上)

computer 3 (Windows 2008R2, on amazon cloud)

OS Name Microsoft Windows Server 2008 R2 Datacenter
Version 6.1.7601 Service Pack 1 Build 7601
System Type x64-based PC
Processor   Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 2500 Mhz, 
4 Core(s), 8 Logical Processor(s)
Installed Physical Memory (RAM) 30.0 GB

两台计算机都运行Python 3.2并具有相同的sklearn,numpy,scipy版本

Both computers are running with Python 3.2 and identical sklearn, numpy, scipy versions

我按照以下方式运行 cProfile :

I ran cProfile as follows:

print(vectors.shape)
>>> (7500, 2042)

_decomp = TruncatedSVD(n_components=680, random_state=1)
global _o
_o = _decomp
cProfile.runctx('_o.fit_transform(vectors)', globals(), locals(), sort=1)

计算机1输出

>>>    833 function calls in 1.710 seconds
Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.767    0.767    0.782    0.782 decomp_svd.py:15(svd)
    1    0.249    0.249    0.249    0.249 {method 'enable' of '_lsprof.Profiler' objects}
    1    0.183    0.183    0.183    0.183 {method 'normal' of 'mtrand.RandomState' objects}
    6    0.174    0.029    0.174    0.029 {built-in method csr_matvecs}
    6    0.123    0.021    0.123    0.021 {built-in method csc_matvecs}
    2    0.110    0.055    0.110    0.055 decomp_qr.py:14(safecall)
    1    0.035    0.035    0.035    0.035 {built-in method dot}
    1    0.020    0.020    0.589    0.589 extmath.py:185(randomized_range_finder)
    2    0.018    0.009    0.019    0.010 function_base.py:532(asarray_chkfinite)
   24    0.014    0.001    0.014    0.001 {method 'ravel' of 'numpy.ndarray' objects}
    1    0.007    0.007    0.009    0.009 twodim_base.py:427(triu)
    1    0.004    0.004    1.710    1.710 extmath.py:232(randomized_svd)

计算机2输出

>>>    858 function calls in 40.145 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    2   32.116   16.058   32.116   16.058 {built-in method dot}
    1    6.148    6.148    6.156    6.156 decomp_svd.py:15(svd)
    2    0.561    0.281    0.561    0.281 decomp_qr.py:14(safecall)
    6    0.561    0.093    0.561    0.093 {built-in method csr_matvecs}
    1    0.337    0.337    0.337    0.337 {method 'normal' of 'mtrand.RandomState' objects}
    6    0.202    0.034    0.202    0.034 {built-in method csc_matvecs}
    1    0.052    0.052    1.633    1.633 extmath.py:183(randomized_range_finder)
    1    0.045    0.045    0.054    0.054 _methods.py:73(_var)
    1    0.023    0.023    0.023    0.023 {method 'argmax' of 'numpy.ndarray' objects}
    1    0.023    0.023    0.046    0.046 extmath.py:531(svd_flip)
    1    0.016    0.016   40.145   40.145 <string>:1(<module>)
   24    0.011    0.000    0.011    0.000 {method 'ravel' of 'numpy.ndarray' objects}
    6    0.009    0.002    0.009    0.002 {method 'reduce' of 'numpy.ufunc' objects}
    2    0.008    0.004    0.009    0.004 function_base.py:532(asarray_chkfinite)

计算机3输出

>>>         858 function calls in 2.223 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.956    0.956    0.972    0.972 decomp_svd.py:15(svd)
    2    0.306    0.153    0.306    0.153 {built-in method dot}
    1    0.274    0.274    0.274    0.274 {method 'normal' of 'mtrand.RandomState' objects}
    6    0.205    0.034    0.205    0.034 {built-in method csr_matvecs}
    6    0.151    0.025    0.151    0.025 {built-in method csc_matvecs}
    2    0.133    0.067    0.133    0.067 decomp_qr.py:14(safecall)
    1    0.032    0.032    0.043    0.043 _methods.py:73(_var)
    1    0.030    0.030    0.030    0.030 {method 'argmax' of 'numpy.ndarray' objects}
   24    0.026    0.001    0.026    0.001 {method 'ravel' of 'numpy.ndarray' objects}
    2    0.019    0.010    0.020    0.010 function_base.py:532(asarray_chkfinite)
    1    0.019    0.019    0.773    0.773 extmath.py:183(randomized_range_finder)
    1    0.019    0.019    0.049    0.049 extmath.py:531(svd_flip)

注意 {内置方法点} 从0.035s/通话到16.058s/通话的差异,慢了450倍!

Notice the {built-in method dot} difference from 0.035s/call to 16.058s/call, 450 times slower!!

------+---------+---------+---------+---------+---------------------------------------
ncalls| tottime | percall | cumtime | percall | filename:lineno(function)  HARDWARE
------+---------+---------+---------+---------+---------------------------------------
1     |  0.035  |  0.035  |  0.035  |  0.035  | {built-in method dot}      Computer 1
2     | 32.116  | 16.058  | 32.116  | 16.058  | {built-in method dot}      Computer 2
2     |  0.306  |  0.153  |  0.306  |  0.153  | {built-in method dot}      Computer 3

我知道应该存在性能差异,但是我应该那么高吗?

I understand that there should be performance differences, but I should it be that high?

有没有一种方法可以进一步调试此性能问题?

Is there a way I can further debug this performance issue?

编辑

我测试了一台新计算机,即计算机3,其硬件类似于计算机2,并且具有不同的操作系统

I tested a new computer, computer 3 which its HW is similar to computer 2 and with different OS

{内置方法点}的结果是每次调用0.153秒,仍然比Linux快100倍!

编辑2

计算机1 numpy配置

>>> np.__config__.show()
lapack_opt_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd', 'mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
    library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
blas_opt_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
    library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
openblas_info:
  NOT AVAILABLE
lapack_mkl_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd', 'mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
    library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
blas_mkl_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
    library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']
mkl_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd']
    library_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/lib/intel64']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['C:/Program Files (x86)/Intel/Composer XE/mkl/include']

计算机2 numpy配置

>>> np.__config__.show()
lapack_info:
  NOT AVAILABLE
lapack_opt_info:
  NOT AVAILABLE
blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
atlas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
lapack_src_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
blas_opt_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

推荐答案

{built-in method dot}np.dot函数,它是围绕CBLAS例程用于矩阵矩阵,矩阵向量和向量向量乘法的NumPy包装器.您的Windows计算机使用CBLAS的英特尔MKL 版本. Linux机器正在使用缓慢的旧参考实现.

{built-in method dot} is the np.dot function, which is a NumPy wrapper around the CBLAS routines for matrix-matrix, matrix-vector and vector-vector multiplication. Your Windows machines uses the heavily tuned Intel MKL version of CBLAS. The Linux machine is using the slow old reference implementation.

如果您安装 ATLAS

If you install ATLAS or OpenBLAS (both available through Linux package managers) or, in fact, Intel MKL, you're likely to see massive speedups. Try sudo apt-get install libatlas-dev, check the NumPy config again to see if it picked up ATLAS, and measure again.

一旦您选择了正确的CBLAS库,您可能需要重新编译scikit-learn.大多数算法只是将NumPy用于其线性代数需求,但是某些算法(尤其是k均值)直接使用CBLAS.

Once you've decided on the right CBLAS library, you may want to recompile scikit-learn. Most of it just uses NumPy for its linear algebra needs, but some algorithms (notably k-means) use CBLAS directly.

操作系统与此无关.

这篇关于Linux和Windows之间的numpy性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆