numpy,获取最大子集 [英] numpy, get maximum of subsets

查看:132
本文介绍了numpy,获取最大子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个值数组,例如v(例如v=[1,2,3,4,5,6,7,8,9,10])和一个索引数组,例如g(例如g=[0,0,0,0,1,1,1,1,2,2]).

I have an array of values, said v, (e.g. v=[1,2,3,4,5,6,7,8,9,10]) and an array of indexes, say g (e.g. g=[0,0,0,0,1,1,1,1,2,2]).

例如,我知道如何以非常python化的方式获取每个组的第一个元素:

I know, for instance, how to take the first element of each group, in a very numpythonic way, doing:

import numpy as np
v=np.array([1,2,3,4,74,73,72,71,9,10])
g=np.array([0,0,0,0,1,1,1,1,2,2])
mask=np.concatenate(([True],np.diff(g)!=0))
v[mask]

返回:

array([1, 74, 9])

是否有任何numpy thonic方法(避免显式循环)来获取每个子集的最大值?

Is there any numpythonic way (avoiding explicit loops) to get the maximum of each subset?

由于我收到了两个很好的答案,一个使用python map,一个使用numpy例程,并且我搜索的是性能最高的,这里是一些计时测试:

Since I received two good answers, one with the python map and one with a numpy routine, and I was searching the most performing, here some timing tests:

import numpy as np
import time
N=10000000
v=np.arange(N)
Nelemes_per_group=10
Ngroups=N/Nelemes_per_group
s=np.arange(Ngroups)
g=np.repeat(s,Nelemes_per_group)

start1=time.time()
r=np.maximum.reduceat(v, np.unique(g, return_index=True)[1])
end1=time.time()
print('END first method, T=',(end1-start1),'s')

start3=time.time()
np.array(list(map(np.max,np.split(v,np.where(np.diff(g)!=0)[0]+1))))
end3=time.time()
print('END second method,  (map returns an iterable) T=',(end3-start3),'s')

结果是:

END first method, T= 1.6057236194610596 s
END second method,  (map returns an iterable) T= 8.346540689468384 s

有趣的是,map方法的大多数放慢归因于list()调用.如果我不尝试将我的map结果转换为list(但我必须这样做,因为python3.x返回一个迭代器:

Interestingly, most of the slowdown of the map method is due to the list() call. If I do not try to reconvert my map result to a list ( but I have to, because python3.x returns an iterator: https://docs.python.org/3/library/functions.html#map )

推荐答案

您可以使用np.maximum.reduceat:

>>> _, idx = np.unique(g, return_index=True)
>>> np.maximum.reduceat(v, idx)
array([ 4, 74, 10])

有关ufunc reduceat方法工作原理的更多信息,请参见

More about the workings of the ufunc reduceat method can be found here.

关于性能的评论

np.maximum.reduceat非常快.大部分时间都是在生成索引idx的.

np.maximum.reduceat is very fast. Generating the indices idx is what takes most of the time here.

虽然_, idx = np.unique(g, return_index=True)是获取索引的一种优雅方式,但并不是特别快.

While _, idx = np.unique(g, return_index=True) is an elegant way to get the indices, it is not particularly quick.

原因是np.unique需要首先对数组进行排序,复杂度为O(n log n).对于大型阵列,这比使用几个O(n)操作生成idx的成本高得多.

The reason is that np.unique needs to sort the array first, which is O(n log n) in complexity. For large arrays, this is much more expensive than using several O(n) operations to generate idx.

因此,对于大型数组,改用以下命令会更快:

Therefore, for large arrays it is much faster to use the following instead:

idx = np.concatenate([[0], 1+np.diff(g).nonzero()[0]])
np.maximum.reduceat(v, idx)

这篇关于numpy,获取最大子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆