在numpy中将小矩阵与标量相乘的最有效方法 [英] Most efficient way to multiply a small matrix with a scalar in numpy

查看:116
本文介绍了在numpy中将小矩阵与标量相乘的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序,它的主要性能瓶颈涉及将一个维度为 1 的矩阵和另一个大维度的矩阵相乘,例如1000:

I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000:

large_dimension = 1000

a = np.random.random((1,))
b = np.random.random((1, large_dimension))

c = np.matmul(a, b)

换句话说,将矩阵 b 与标量 a[0] 相乘.

In other words, multiplying matrix b with the scalar a[0].

我正在寻找最有效的方法来计算这个,因为这个操作被重复了数百万次.

I am looking for the most efficient way to compute this, since this operation is repeated millions of times.

我测试了两种简单方法的性能,它们实际上是等效的:

I tested for performance of the two trivial ways to do this, and they are practically equivalent:

%timeit np.matmul(a, b)
>> 1.55 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit a[0] * b
>> 1.77 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

有没有更有效的方法来计算这个?

Is there a more efficient way to compute this?

  • 注意:我无法将这些计算移至 GPU,因为该程序使用多处理,并且许多此类计算是并行完成的.

推荐答案

在这种情况下,使用逐元素乘法可能会更快,但您看到的时间主要是 Numpy 的开销(从 CPython 解释器调用 C 函数、包装/解包类型、进行检查、执行操作、数组分配等).

In this case, it is probably faster to work with an element-wise multiplication but the time you see is mostly the overhead of Numpy (calling C functions from the CPython interpreter, wrapping/unwraping types, making checks, doing the operation, array allocations, etc.).

因为这个操作被重复了数百万次

since this operation is repeated millions of times

这就是问题所在.事实上,CPython解释器在处理低延迟的事情方面非常糟糕.当您在 Numpy 类型上工作时尤其如此,因为调用 C 代码并执行对琐碎操作的检查比在纯 Python 中执行要慢得多,而纯 Python 也比编译的本机 C/C++ 代码慢得多.如果你真的需要这个,并且你不能使用 Numpy 向量化你的代码(因为你有一个迭代时间步长的循环),那么你就不要使用 CPython,或者至少不要使用纯 Python 代码.相反,您可以使用 NumbaCython 来减轻执行 C 调用、包装类型等的影响.如果这还不够,那么您将需要编写原生 C/C++ 代码(或任何类似的语言),除非您找到一个专用的 Python 包 正好为您做这件事.请注意,Numba 仅在适用于本机类型或 Numpy 数组(包含本机类型)时才快速.如果您使用大量纯 Python 类型并且不想重写代码,那么您可以尝试 PyPy JIT.

This is the problem. Indeed, the CPython interpreter is very bad at doing things with a low latency. This is especially true when you work on Numpy types as calling a C code and performing checks for trivial operation is much slower than doing it in pure Python which is also much slower than compiled native C/C++ codes. If you really need this, and you cannot vectorize your code using Numpy (because you have a loop iterating over timesteps), then you move away from using CPython, or at least not a pure Python code. Instead, you can use Numba or Cython to mitigate the impact doing C calls, wrapping types, etc. If this is not enough, then you will need to write a native C/C++ code (or any similar language) unless you find exactly a dedicated Python package doing exactly that for you. Note that Numba is fast only when it works on native types or Numpy arrays (containing native types). If you works with a lot of pure Python types and you do not want to rewrite your code, then you can try the PyPy JIT.

这是 Numba 中的一个简单示例,它避免了(代价高昂的)创建/分配新数组(以及许多 Numpy 内部检查和调用),专门为解决您的特定情况而编写:

Here is a simple example in Numba avoiding the (costly) creation/allocation of a new array (as well as many Numpy internal checks and calls) that is specifically written to solve your specific case:

@nb.njit('void(float64[::1],float64[:,::1],float64[:,::1])')
def fastMul(a, b, out):
    val = a[0]
    for i in range(b.shape[1]):
        out[0,i] = b[0,i] * val

res = np.empty(b.shape, dtype=b.dtype)
%timeit fastMul(a, b, res)
# 397 ns ± 0.587 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

在撰写本文时,此解决方案比所有其他解决方案都快.由于大部分时间都花在调用 Numba 和执行一些内部检查上,因此直接将 Numba 用于包含迭代循环的函数应该会产生更快的代码.

At the time of writing, this solution is faster than all the others. As most of the time is spent in calling Numba and performing some internal checks, using Numba directly for the function containing the iteration loop should result in an even faster code.

这篇关于在numpy中将小矩阵与标量相乘的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆