对滞后列求和的最快方法 [英] Fastest way to sum columns with lag
问题描述
给定一个方阵,我想按行号移动每一行并对列求和.例如:
array([[0, 1, 2], array([[0, 1, 2],[3, 4, 5], ->[3, 4, 5], ->数组([0, 1+3, 2+4+6, 5+7, 8]) = 数组([0, 4, 12, 12, 8])[6, 7, 8]]) [6, 7, 8]])
我有 4 个解决方案 -
对于100x100
的数组,slow
比任何解决方案(不使用 numba
)都要快一点.sum_antidiagonals
仍然是 1000x1000
数组的最佳选择.
这里有一种有时比您的fast()
"更快的方法;版本,但仅在 n
的有限范围内(大约在 30 到 1000 之间),用于 n x n
数组.循环 (fast()
) 在大型数组上非常难以击败,即使使用 numba
,并且实际上渐近收敛到简单的时间a.sum(axis=0)
,这表明这与处理大型数组的效率差不多(感谢您的循环!)
我将称之为 sum_antidiagonals()
的方法在 a
的条纹版本上使用 np.add.reduce()
并在来自相对较小的一维阵列的制作蒙版上进行条纹化以创建二维阵列的错觉(不消耗更多内存).
此外,它不仅限于方形数组(但 fast()
也可以很容易地适应这种泛化,请参阅本文底部的 fast_g()
发帖).
def sum_antidiagonals(a):断言 a.flags.c_contiguousr, c = a.shapes0, s1 = a.stridesz = np.lib.stride_tricks.as_strided(a, shape=(r, c+r-1), strides=(s0 - s1, s1), writeable=False)# 面具kern = np.r_[np.repeat(False, r-1), np.repeat(True, c), np.repeat(False, r-1)]掩码 = np.fliplr(np.lib.stride_tricks.as_strided(字距,形状=(r,c+r-1),步幅=(1, 1),可写=假))返回 np.add.reduce(z, where=mask)
请注意,它不仅限于方形数组:
>>>sum_antidiagonals(np.arange(15).reshape(5,3))数组([ 0, 4, 12, 21, 30, 24, 14])
说明
要了解其工作原理,让我们通过一个示例来检查这些条带数组.
给定一个初始数组a
,即(3, 2)
:
a = np.arange(6).reshape(3, 2)>>>一种数组([[0, 1],[2, 3],[4, 5]])# 在函数中计算z后>>>z数组([[0, 1, 2, 3],[1, 2, 3, 4],[2, 3, 4, 5]])
你可以看到几乎就是我们想要的sum(axis=0)
,只是下三角和上三角是不需要的.我们真正想要总结的是:
array([[0, 1, -, -],[-, 2, 3, -],[-, -, 4, 5]])
输入掩码,我们可以从一维内核开始构建:
kern = np.r_[np.repeat(False, r-1), np.repeat(True, c), np.repeat(False, r-1)]>>>字距数组([假,假,真,真,假,假])
我们使用了一个有趣的切片:(1, 1)
,这意味着我们重复同一行,但每次滑动一个元素:
>>>np.lib.stride_tricks.as_strided(...字距,形状=(r,c+r-1),步幅=(1, 1),可写=假)数组([[假,假,真,真],[假,真,真,假],[对,对,错,错]])
然后我们只需将其向左/向右翻转,并将其用作 np.add.reduce()
的 where
参数.
速度
b = np.random.normal(size=(1000, 1000))# 检查与 OP 的 fast() 函数的等效性:>>>np.allclose(fast(b), sum_antidiagonals(b))真的%timeit sum_anti对角线(b)# 每个循环 1.83 ms ± 840 ns(平均值 ± 标准偏差,7 次运行,每次 1000 次循环)%timeit 快(b)# 每个循环 2.07 ms ± 15.2 µs(平均值 ± 标准偏差,7 次运行,每次 100 次循环)
在这种情况下,它会快一点,但只有大约 10%.
在 300x300 数组上,sum_antidiagonals()
比 fast()
快 2.27 倍.
但是!
即使将 z
和 mask
放在一起非常快(np.add.reduce()
之前的整个设置只需要 46 µs在上面的 1000x1000 示例中),求和本身是 O[r (r+c)]
,即使只有 O[rc]
实际加法(其中 mask== True
) 是必需的.因此,对于一个方形数组,需要完成大约 2 倍的操作.
在 10K x 10K 阵列上,这赶上了我们:
fast
需要 95 毫秒,而sum_antidiagonals
需要 208 毫秒.
尺寸范围比较
我们将使用可爱的
观察
- 如您所见,
sum_antidiagonals()
相对于fast()
的速度优势仅限于n
的范围,大约在 30 到 1000 之间. - 它永远胜过
numba
版本. just_sum_0()
,它只是沿axis=0
的总和(因此是一个很好的底线基准,几乎不可能被击败),只是稍微快一点对于大型阵列.这一事实表明fast()
与处理大型数组的速度差不多.- 出人意料的是,
numba
在达到一定大小后(也就是在前几次运行以烧入"LLVM 编译之后)会减损.我不知道为什么会这样,但它似乎对大型数组很重要.
其他功能的完整代码
(包括将fast
简单推广到非方阵)
from numba import njit@njit定义 nb_fast_ij(a):# numba 喜欢循环...r, c = a.shapez = np.zeros(c + r - 1, dtype=a.dtype)对于范围内的 i(r):对于范围(c)中的 j:z[i+j] += a[i, j]返回 z@njit定义 nb_fast_i(a):r, c = a.shapez = np.zeros(c + r - 1, dtype=a.dtype)对于范围内的 i(r):z[i:i+c] += a[i, :]返回 zdef fast_g(a):# 将 fast() 推广到非方形数组,也返回相同的 dtyper, c = a.shapez = np.zeros(c + r - 1, dtype=a.dtype)对于范围内的 i(r):z[i:i+c] += a[i]返回 z定义快速(A):# OP的代码n = A.shape[0]retval = np.zeros(2*n-1)对于范围(n)中的我:retval[i:(i+n)] += A[i, :]返回值def just_sum_0(a):# 用于基准比较返回 a.sum(axis=0)
Given a square matrix, I want to shift each row by its row-number and sum the columns. For example:
array([[0, 1, 2], array([[0, 1, 2],
[3, 4, 5], -> [3, 4, 5], -> array([0, 1+3, 2+4+6, 5+7, 8]) = array([0, 4, 12, 12, 8])
[6, 7, 8]]) [6, 7, 8]])
I have 4 solutions - fast
, slow
, slower
and slowest
, which do exactly the same thing and are ranked by speed:
def fast(A):
n = A.shape[0]
retval = np.zeros(2*n-1)
for i in range(n):
retval[i:(i+n)] += A[i, :]
return retval
def slow(A):
n = A.shape[0]
indices = np.arange(n)
indices = indices + indices[:,None]
return np.bincount(indices.ravel(), A.ravel())
def slower(A):
r, _ = A.shape
retval = np.c_[A, np.zeros((r, r), dtype=A.dtype)].ravel()[:-r].reshape(r, -1)
return retval.sum(0)
def slowest(A):
n = A.shape[0]
retval = np.zeros(2*n-1)
indices = np.arange(n)
indices = indices + indices[:,None]
np.add.at(retval, indices, A)
return retval
Surprisingly, the non-vectorized solution is the fastest. Here is my benchmark:
A = np.random.randn(1000,1000)
%timeit fast(A)
# 1.85 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit slow(A)
# 3.28 ms ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit slower(A)
# 4.07 ms ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit slowest(A)
# 58.4 ms ± 993 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Does there exist a faster solution? And if not, can someone explain why in fact fast
is the fastest?
EDIT
A slight speedup for slow
:
def slow(A):
n = A.shape[0]
indices = np.arange(2*n-1)
indices = np.lib.stride_tricks.as_strided(indices, A.shape, (8,8))
return np.bincount(indices.ravel(), A.ravel())
Plotting the runtime in the same way as Pierre did (with 2**15 as upper bound - for some reason slow
does not handle this size)
slow
is marginally faster than any solution (not using numba
) for arrays that are 100x100
. sum_antidiagonals
is still the best for 1000x1000
arrays.
Here is a way that is sometimes faster than your "fast()
" version, but only in a limited range of n
(roughly between 30 and 1000) for a n x n
array. The loop (fast()
) is very hard to beat on large arrays, even using numba
, and actually converges asymptotically to the time of simply a.sum(axis=0)
, which indicates that this is about as efficient as it gets for large arrays (kudos to your loop!)
The method, which I'll call sum_antidiagonals()
, uses np.add.reduce()
on a striped version of a
and on a made-up mask from a relatively small 1D array that is striped to create the illusion of a 2D array (without consuming more memory).
Additionally, it is not limited to square arrays (but fast()
can be easily adapted for this generalization as well, see fast_g()
at the bottom of this post).
def sum_antidiagonals(a):
assert a.flags.c_contiguous
r, c = a.shape
s0, s1 = a.strides
z = np.lib.stride_tricks.as_strided(
a, shape=(r, c+r-1), strides=(s0 - s1, s1), writeable=False)
# mask
kern = np.r_[np.repeat(False, r-1), np.repeat(True, c), np.repeat(False, r-1)]
mask = np.fliplr(np.lib.stride_tricks.as_strided(
kern, shape=(r, c+r-1), strides=(1, 1), writeable=False))
return np.add.reduce(z, where=mask)
Notice that it isn't limited to a square array:
>>> sum_antidiagonals(np.arange(15).reshape(5,3))
array([ 0, 4, 12, 21, 30, 24, 14])
Explanation
To understand how that works, let's examine these striped arrays with an example.
Given an initial array a
that is (3, 2)
:
a = np.arange(6).reshape(3, 2)
>>> a
array([[0, 1],
[2, 3],
[4, 5]])
# after calculating z in the function
>>> z
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
You can see that it is almost what we want to sum(axis=0)
, except that the lower and upper triangles are unwanted. What we would really like to sum looks like:
array([[0, 1, -, -],
[-, 2, 3, -],
[-, -, 4, 5]])
Enter the mask, which we can build starting from a 1D kernel:
kern = np.r_[np.repeat(False, r-1), np.repeat(True, c), np.repeat(False, r-1)]
>>> kern
array([False, False, True, True, False, False])
We use a funny slice: (1, 1)
, which means that we repeat the same row, but sliding by one element each time:
>>> np.lib.stride_tricks.as_strided(
... kern, shape=(r, c+r-1), strides=(1, 1), writeable=False)
array([[False, False, True, True],
[False, True, True, False],
[ True, True, False, False]])
We then just flip this left/right, and use it as the where
argument for np.add.reduce()
.
Speed
b = np.random.normal(size=(1000, 1000))
# check equivalence with the OP's fast() function:
>>> np.allclose(fast(b), sum_antidiagonals(b))
True
%timeit sum_antidiagonals(b)
# 1.83 ms ± 840 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit fast(b)
# 2.07 ms ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In this case, it is a little bit faster, but only by about 10%.
On a 300x300 array, sum_antidiagonals()
is 2.27x faster than fast()
.
However!
Even though putting together z
and mask
is very fast (the whole setup before np.add.reduce()
takes only 46 µs in the 1000x1000 example above), the summation itself is O[r (r+c)]
, even though only O[r c]
actual additions (where mask == True
) are needed. There is therefore about 2x more operations done for a square array.
On a 10K x 10K array, this catches up with us:
fast
takes 95ms, whereassum_antidiagonals
takes 208 ms.
Comparison through range of sizes
We'll use the lovely perfplot
package to compare speeds of a number of approaches through ranges of n
:
perfplot.show(
setup=lambda n: np.random.normal(size=(n, n)),
kernels=[just_sum_0, fast, fast_g, nb_fast_i, nb_fast_ij, sum_antidiagonals],
n_range=[2 ** k for k in range(3, 16)],
equality_check=None, # because of just_sum_0
xlabel='n',
relative_to=1,
)
Observations
- As you can see,
sum_antidiagonals()
speed advantage overfast()
is limited to a range ofn
roughly between 30 and 1000. - It never beats the
numba
versions. just_sum_0()
, which is simply the summation alongaxis=0
(and thus a good bottom-line benchmark, almost impossible to beat), is only marginally faster for large arrays. That fact indicates thatfast()
is about as fast as it will get for large arrays.- Surprisingly,
numba
detracts after a certain size (and that is after a first few runs to "burn in" the LLVM compilation). I am not sure why that is the case, but it seems to become significant for large arrays.
Full code for the other functions
(Including a simple generalization of fast
to non-square arrays)
from numba import njit
@njit
def nb_fast_ij(a):
# numba loves loops...
r, c = a.shape
z = np.zeros(c + r - 1, dtype=a.dtype)
for i in range(r):
for j in range(c):
z[i+j] += a[i, j]
return z
@njit
def nb_fast_i(a):
r, c = a.shape
z = np.zeros(c + r - 1, dtype=a.dtype)
for i in range(r):
z[i:i+c] += a[i, :]
return z
def fast_g(a):
# generalizes fast() to non-square arrays, also returns the same dtype
r, c = a.shape
z = np.zeros(c + r - 1, dtype=a.dtype)
for i in range(r):
z[i:i+c] += a[i]
return z
def fast(A):
# the OP's code
n = A.shape[0]
retval = np.zeros(2*n-1)
for i in range(n):
retval[i:(i+n)] += A[i, :]
return retval
def just_sum_0(a):
# for benchmarking comparison
return a.sum(axis=0)
这篇关于对滞后列求和的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!