Python Numpy:np.int32“更慢"比np.float64 [英] Python Numpy : np.int32 "slower" than np.float64

查看:644
本文介绍了Python Numpy:np.int32“更慢"比np.float64的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解python的一种奇怪行为. 让我们考虑形状为6000 x 2000的矩阵M.此矩阵填充有符号整数.我想计算np.transpose(M)*M.两种选择:

I would like to understand a strange behavior of python. Let us consider a matrix Mwith shape 6000 x 2000. This matrix is filled with signed integers. I want to compute np.transpose(M)*M. Two options:

  • 当我自然"地执行操作(即不指定任何类型)时,numpy选择类型np.int32,该操作大约需要150秒.
  • 当我强制类型为np.float64(使用dtype=...)时,相同的操作大约需要2秒钟.
  • When I do it "naturally" (i.e. without specifying any typing), numpy selects the type np.int32 and the operation takes around 150s.
  • When I force the type to be np.float64 (using dtype=...), the same operation takes around 2s.

我们如何解释这种行为?我天真地以为int乘法要比float乘法便宜.

How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication.

非常感谢您的帮助.

推荐答案

不,整数乘法并不便宜.但是稍后会更多. numpy最有可能(我确信99%)在毯子下调用BLAS例程,其效率可达到峰值CPU性能的90%. int矩阵乘法没有特殊的规定,很可能是在Python中完成的,而不是机器编译的版本-我在这方面实际上是错误的,请参见下文.

No, integer multiplies aren't cheaper. But more on that later. Most likely (I am 99% sure) numpy calls BLAS routine under blankets, which can be as efficient as 90% of peak CPU performance. There aren't special provisions for int matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below.

关于intfloat的速度:在大多数体系结构(Intel)上,它们大致相同,每条指令大约3-5个周期左右,均具有串行(X87)和矢量(XMM)版本.在桑迪桥上,PMUL***(整数向量乘法)为5个周期,MULP*(浮点乘法)也为5个周期.使用Sandy Bridge,您还具有256位SIMD向量操作(YMM)-每个指令可获得8个float操作-我不确定是否有int对应对象.

With regards to int vs float speed: on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version. On Sandy bridge, PMUL*** (integer vector multiply) is 5 cycles and so are the MULP* (floating multiplies). With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8 float ops per instructions - I am not sure if there is an int counterpart.

这是一个很好的参考: http://www.agner.org/optimize/instruction_tables .pdf

This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf

也就是说,指令等待时间不能解释75倍速差.它可能是优化的BLAS(可能是线程化)和int32的组合,而int32是用Python而不是C/Fortran处理的.

That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran.

我分析了以下代码段:

>>> F = (np.random.random((6000,2000))+4)
>>> I = F.astype(np.int32)
>>> np.dot(F, F.transpose()); np.dot(I, I.transpose())

这就是oprofile所说的:

and this is what oprofile says:

CPU_CLK_UNHALT...|
  samples|      %|
------------------
  2076880 51.5705 multiarray.so
  1928787 47.8933 libblas.so.3.0

但是libblas是未优化的串行Netlib Blas.通过良好的BLAS实现,可以降低47%,尤其是使用线程时.

However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded.

numpy 确实提供了整数矩阵乘法的编译版本.

It seems numpy does provide compiled version of integer matrix multiply.

这篇关于Python Numpy:np.int32“更慢"比np.float64的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆