在numpy数组中按最大或最小分组 [英] Group by max or min in a numpy array
问题描述
我有两个等长的一维numpy数组id
和data
,其中id
是重复的有序整数序列,这些整数定义了data
上的子窗口.例如,
I have two equal-length 1D numpy arrays, id
and data
, where id
is a sequence of repeating, ordered integers that define sub-windows on data
. For example,
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
我想通过对id
进行分组并采用最大值或最小值来汇总data
.在SQL中,这将是典型的聚合查询,例如SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id
.有没有一种方法可以避免Python循环并以矢量化方式执行此操作,还是必须降到C?
I would like to aggregate data
by grouping on id
and taking either the max or the min. In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id
. Is there a way I can avoid Python loops and do this in a vectorized manner, or do I have to drop down to C?
推荐答案
最近几天,我一直在堆栈上看到一些非常相似的问题.以下代码与numpy.unique的实现非常相似,并且由于它利用了底层的numpy机制,因此它很可能会比在python循环中可以执行的任何操作都要快.
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
这篇关于在numpy数组中按最大或最小分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!