Python Numpy:np.int32“更慢"比np.float64 [英] Python Numpy : np.int32 "slower" than np.float64
问题描述
我想了解python的一种奇怪行为.
让我们考虑形状为6000 x 2000
的矩阵M
.此矩阵填充有符号整数.我想计算np.transpose(M)*M
.两种选择:
I would like to understand a strange behavior of python.
Let us consider a matrix M
with shape 6000 x 2000
. This matrix is filled with signed integers. I want to compute np.transpose(M)*M
. Two options:
- 当我自然"地执行操作(即不指定任何类型)时,numpy选择类型
np.int32
,该操作大约需要150秒. - 当我强制类型为
np.float64
(使用dtype=...
)时,相同的操作大约需要2秒钟.
- When I do it "naturally" (i.e. without specifying any typing), numpy selects the type
np.int32
and the operation takes around 150s. - When I force the type to be
np.float64
(usingdtype=...
), the same operation takes around 2s.
我们如何解释这种行为?我天真地以为int乘法要比float乘法便宜.
How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication.
非常感谢您的帮助.
推荐答案
不,整数乘法并不便宜.但是稍后会更多.
numpy
最有可能(我确信99%)在毯子下调用BLAS
例程,其效率可达到峰值CPU性能的90%. int
矩阵乘法没有特殊的规定,很可能是在Python中完成的,而不是机器编译的版本-我在这方面实际上是错误的,请参见下文.
No, integer multiplies aren't cheaper. But more on that later.
Most likely (I am 99% sure) numpy
calls BLAS
routine under blankets, which can be as efficient as 90% of peak CPU performance. There aren't special provisions for int
matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below.
关于int
与float
的速度:在大多数体系结构(Intel)上,它们大致相同,每条指令大约3-5个周期左右,均具有串行(X87)和矢量(XMM)版本.在桑迪桥上,PMUL***
(整数向量乘法)为5个周期,MULP*
(浮点乘法)也为5个周期.使用Sandy Bridge,您还具有256位SIMD向量操作(YMM)-每个指令可获得8个float
操作-我不确定是否有int
对应对象.
With regards to int
vs float
speed: on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version. On Sandy bridge, PMUL***
(integer vector multiply) is 5 cycles and so are the MULP*
(floating multiplies). With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8 float
ops per instructions - I am not sure if there is an int
counterpart.
这是一个很好的参考: http://www.agner.org/optimize/instruction_tables .pdf
This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf
也就是说,指令等待时间不能解释75倍速差.它可能是优化的BLAS(可能是线程化)和int32的组合,而int32是用Python而不是C/Fortran处理的.
That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran.
我分析了以下代码段:
>>> F = (np.random.random((6000,2000))+4)
>>> I = F.astype(np.int32)
>>> np.dot(F, F.transpose()); np.dot(I, I.transpose())
这就是oprofile所说的:
and this is what oprofile says:
CPU_CLK_UNHALT...|
samples| %|
------------------
2076880 51.5705 multiarray.so
1928787 47.8933 libblas.so.3.0
但是libblas是未优化的串行Netlib Blas.通过良好的BLAS实现,可以降低47%,尤其是使用线程时.
However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded.
numpy 确实提供了整数矩阵乘法的编译版本.
It seems numpy does provide compiled version of integer matrix multiply.
这篇关于Python Numpy:np.int32“更慢"比np.float64的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!