NumPy基于另一个数组中的值对第三数组中的每个匹配元素求和一个数组 [英] NumPy sum one array based on values in another array for each matching element in 3rd array
问题描述
我有两个numpy数组,一个包含值,一个包含每个值类别.
I have two numpy arrays, one containing values and one containing each values category.
values=np.array([1,2,3,4,5,6,7,8,9,10])
valcats=np.array([101,301,201,201,102,302,302,202,102,301])
我还有另一个数组,其中包含我想总结的唯一类别.
I have another array containing the unique categories I'd like to sum across.
categories=np.array([101,102,201,202,301,302])
我的问题是我将运行相同的求和过程数十亿次,每微秒都很重要.
My issue is that I will be running this same summing process a few billion times and every microsecond matters.
我当前的实现如下.
catsums=[]
for x in categories:
catsums.append(np.sum(values[np.where(valcats==x)]))
产生的总和应该是:
[1, 14, 7, 8, 12, 13]
我当前的运行时间约为5 µs.我对Python还是有些陌生,并希望通过潜在地组合前两个数组或lamdba或一些我什至不知道的很酷的东西来找到一种快速的解决方案.
My current run time is about 5 µs. I am somewhat new still to Python and was hoping to find a fast solution by potentially combining the first two arrays or lamdba or something cool I don't even know about.
感谢阅读!
推荐答案
@Divakar刚刚发布了一个很好的答案.如果您已经定义了类别数组,则可以使用@Divakar的答案.如果您尚未定义唯一值,则使用我的.
@Divakar just posted a very good answer. If you already have the array of categories defined, I'd use @Divakar's answer. If you don't have unique values already define, I'd use mine.
我将使用 pd.factorize
分解类别.然后使用 np.bincount
参数weights
设置为values
数组的a>
I'd use pd.factorize
to factorize the categories. Then use np.bincount
with weights
parameter set to be the values
array
f, u = pd.factorize(valcats)
np.bincount(f, values).astype(values.dtype)
array([ 1, 12, 7, 14, 13, 8])
pd.factorize
还会在u
变量中产生唯一值.我们可以将结果与u
对齐,以查看是否找到了正确的解决方案.
pd.factorize
also produces the unique values in the u
variable. We can line up the results with u
to see that we've arrived at the correct solution.
np.column_stack([u, np.bincount(f, values).astype(values.dtype)])
array([[101, 1],
[301, 12],
[201, 7],
[102, 14],
[302, 13],
[202, 8]])
您可以使用pd.Series
f, u = pd.factorize(valcats)
pd.Series(np.bincount(f, values).astype(values.dtype), u)
101 1
301 12
201 7
102 14
302 13
202 8
dtype: int64
为什么 pd.factorize
而不是 np.unique
?
Why pd.factorize
and not np.unique
?
我们本可以用
u, f = np.unique(valcats, return_inverse=True)
但是,np.unique
对值进行排序,并在nlogn
时间运行.另一方面,pd.factorize
不排序,并且在线性时间内运行.对于较大的数据集,pd.factorize
将主导性能.
But, np.unique
sorts the values and that runs in nlogn
time. On the other hand pd.factorize
does not sort and runs in linear time. For larger data sets, pd.factorize
will dominate performance.
这篇关于NumPy基于另一个数组中的值对第三数组中的每个匹配元素求和一个数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!