Python Numpy:np.int32“更慢"比np.float64 [英] Python Numpy : np.int32 "slower" than np.float64

查看：644 发布时间：2020/5/18 21:00:35 python numpy floating-point int32

本文介绍了Python Numpy:np.int32“更慢"比np.float64的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想了解python的一种奇怪行为. 让我们考虑形状为6000 x 2000的矩阵M.此矩阵填充有符号整数.我想计算np.transpose(M)*M.两种选择:

I would like to understand a strange behavior of python. Let us consider a matrix Mwith shape 6000 x 2000. This matrix is filled with signed integers. I want to compute np.transpose(M)*M. Two options:

当我自然"地执行操作(即不指定任何类型)时，numpy选择类型np.int32，该操作大约需要150秒.
当我强制类型为np.float64(使用dtype=...)时，相同的操作大约需要2秒钟.

When I do it "naturally" (i.e. without specifying any typing), numpy selects the type np.int32 and the operation takes around 150s.
When I force the type to be np.float64 (using dtype=...), the same operation takes around 2s.

我们如何解释这种行为?我天真地以为int乘法要比float乘法便宜.

How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication.

非常感谢您的帮助.

推荐答案

不，整数乘法并不便宜.但是稍后会更多. numpy最有可能(我确信99％)在毯子下调用BLAS例程，其效率可达到峰值CPU性能的90％. int矩阵乘法没有特殊的规定，很可能是在Python中完成的，而不是机器编译的版本-我在这方面实际上是错误的，请参见下文.

No, integer multiplies aren't cheaper. But more on that later. Most likely (I am 99% sure) numpy calls BLAS routine under blankets, which can be as efficient as 90% of peak CPU performance. There aren't special provisions for int matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below.

关于int与float的速度:在大多数体系结构(Intel)上，它们大致相同，每条指令大约3-5个周期左右，均具有串行(X87)和矢量(XMM)版本.在桑迪桥上，PMUL***(整数向量乘法)为5个周期，MULP*(浮点乘法)也为5个周期.使用Sandy Bridge，您还具有256位SIMD向量操作(YMM)-每个指令可获得8个float操作-我不确定是否有int对应对象.

With regards to int vs float speed: on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version. On Sandy bridge, PMUL*** (integer vector multiply) is 5 cycles and so are the MULP* (floating multiplies). With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8 float ops per instructions - I am not sure if there is an int counterpart.

这是一个很好的参考: http://www.agner.org/optimize/instruction_tables .pdf

This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf

也就是说，指令等待时间不能解释75倍速差.它可能是优化的BLAS(可能是线程化)和int32的组合，而int32是用Python而不是C/Fortran处理的.

That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran.

我分析了以下代码段:

>>> F = (np.random.random((6000,2000))+4)
>>> I = F.astype(np.int32)
>>> np.dot(F, F.transpose()); np.dot(I, I.transpose())

这就是oprofile所说的:

and this is what oprofile says:

CPU_CLK_UNHALT...|
  samples|      %|
------------------
  2076880 51.5705 multiarray.so
  1928787 47.8933 libblas.so.3.0

但是libblas是未优化的串行Netlib Blas.通过良好的BLAS实现，可以降低47％，尤其是使用线程时.

However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded.

numpy 确实提供了整数矩阵乘法的编译版本.

It seems numpy does provide compiled version of integer matrix multiply.

这篇关于Python Numpy:np.int32“更慢"比np.float64的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python Numpy:np.int32“更慢"比np.float64 [英] Python Numpy : np.int32 "slower" than np.float64

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python Numpy:np.int32“更慢"比np.float64 [英] Python Numpy : np.int32 &quot;slower&quot; than np.float64

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Python Numpy:np.int32“更慢"比np.float64 [英] Python Numpy : np.int32 "slower" than np.float64

登录关闭