按数组a排序Python，并汇总数组b - 性能 [英] Python group by array a, and summarize array b - Performance

查看：253 发布时间：2018/5/30 13:44:12 python performance sorting numpy group-by

本文介绍了按数组a排序Python，并汇总数组b - 性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给出两个相同长度a和b的无序数组：

a = [7,3,5,7,5 ，7] b = [0.2,0.1,0.3,0.1,0.1,0.2]
我想按照以下元素进行分组：

aResult = [7,3,5]
总结b中的元素（用于概括概率密度函数的示例）：

bResult = [0.2 + 0.1 + 0.2,0.1,0.3 + 0.1] = [0.5,0.1,0.4]
或者，在python中随机选择a和b：

导入numpy为np a = np.random.randint（1,10,10000） b = np.array（[1./len（a）] * len（a））
我有两种方法，当然这两种方法都远离性能较低的边界。
方法1（至少好和短）：时间：0.769315958023
def approach_2（a，b）： bResult = [sum（b [i == a]）for n in np.unique（a）] aResult = np.unique（a）
方法2（numpy.groupby，非常慢）时间：4.65299129486

def approach_2（a，b）： tmp = [（a [i]，b [i]）in range（len（a））] tmp2 = np.array （tmp，dtype = [（'a'，float），（'b'，float）]） tmp2 = np.sort（tmp2，order ='a'） bResult = [] aResult = [] for groupby（tmp2，lambda x：x [0]）中的键： aResult.append（key） bResult.append （sum（[i [1] for i in group]））
Update：Approach3，by Pablo 。时间：1.0265750885

def approach_Pablo（a，b）： pdf = defaultdict（int）; for x，y in zip（a，b）： pdf [x] + = y
更新：由Unutbu提供的方法4。时间：0.184849023819 [WINNER SO FAR，但只有一个整数]
def unique_Unutbu（a，b）： x = np.bincount（a，权重= b） aResult = np.unique（a） bResult = x [aResult] pre>

也许有人找到比我更聪明的解决方案来解决这个问题： >
如果 a 由ints< 2 ** 31-1（也就是说，如果 a 的值可以适用于dtype int32 ），那么你可以使用 np.bincount 加权：

import numpy as np a = [7,3,5,7,5,7] b = [0.2,0.1,0.3,0.1,0.1,0.2] x = np.bincount（a，重量= b） print（x）＃[0.0 0. 0.1 0. 0.4 0. 0.5] print（x [[7,3,5 ]]）＃[0.5 0.1 0.4]
np。 unique（a）返回 [3 5 7] ，所以结果以不同的顺序出现：
print（x [np.unique（a）]）＃[0.1 0.4 0.5]
使用 np.bincount 的一个潜在问题是它返回的数组长度等于最大值在 a 中。如果 a 甚至包含一个值接近2 ** 31-1的元素，则 bincount 必须分配一个数组大小 8 *（2 ** 31-1）字节（或16GiB）。 np.bincount 可能是数组 a 的最快解决方案，它们有很大的长度，但不是很大的值。对于长度较小（大或小的值）的数组 a ，使用 collections.defaultdict 可能会更快。

编辑：请参阅 JF塞巴斯蒂安的解决方案，以解决整数值限制和大值问题。

Given two unordered arrays of same lengths a and b:
a = [7,3,5,7,5,7] b = [0.2,0.1,0.3,0.1,0.1,0.2]
I'd like to group by the elements in a:
aResult = [7,3,5]
summing over the elements in b (Example used to summarize a probability density function):
bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]
Alternatively, random a and b in python:
import numpy as np a = np.random.randint(1,10,10000) b = np.array([1./len(a)]*len(a))
I have two approaches, which for sure are far from the lower performance boundary. Approach 1 (at least nice and short): Time: 0.769315958023
def approach_2(a,b): bResult = [sum(b[i == a]) for i in np.unique(a)] aResult = np.unique(a)
Approach 2 (numpy.groupby, horribly slow) Time: 4.65299129486
def approach_2(a,b): tmp = [(a[i],b[i]) for i in range(len(a))] tmp2 = np.array(tmp, dtype = [('a', float),('b', float)]) tmp2 = np.sort(tmp2, order='a') bResult = [] aResult = [] for key, group in groupby(tmp2, lambda x: x[0]): aResult.append(key) bResult.append(sum([i[1] for i in group]))
Update: Approach3, by Pablo. Time: 1.0265750885
def approach_Pablo(a,b): pdf = defaultdict(int); for x,y in zip(a,b): pdf[x] += y
Update: Approach 4, by Unutbu. Time: 0.184849023819 [WINNER SO FAR, but a as integer only]
def unique_Unutbu(a,b): x=np.bincount(a,weights=b) aResult = np.unique(a) bResult = x[aResult]
Maybe someone finds a smarter solution to this problem than me :)
解决方案
If a is composed of ints < 2**31-1 (that is, if a has values that can fit in dtype int32), then you could use np.bincount with weights:
import numpy as np a = [7,3,5,7,5,7] b = [0.2,0.1,0.3,0.1,0.1,0.2] x=np.bincount(a,weights=b) print(x) # [ 0. 0. 0. 0.1 0. 0.4 0. 0.5] print(x[[7,3,5]]) # [ 0.5 0.1 0.4]
np.unique(a) returns [3 5 7], so the result appears in a different order:
print(x[np.unique(a)]) # [ 0.1 0.4 0.5]
One potential problem with using np.bincount is that it returns an array whose length is equal to the maximum value in a. If a contains even one element with value near 2**31-1, then bincount would have to allocate an array of size 8*(2**31-1) bytes (or 16GiB).

So np.bincount might be the fastest solution for arrays a which have big length, but not big values. For arrays a which have small length (and big or small values), using a collections.defaultdict would probably be faster.

Edit: See J.F. Sebastian's solution for a way around the integer-values-only restriction and big-values problem.

这篇关于按数组a排序Python，并汇总数组b - 性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

按数组a排序Python，并汇总数组b - 性能 [英] Python group by array a, and summarize array b - Performance

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

按数组a排序Python，并汇总数组b - 性能 [英] Python group by array a, and summarize array b - Performance

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭