按数组a排序Python,并汇总数组b - 性能 [英] Python group by array a, and summarize array b - Performance
问题描述
a = [7,3,5,7,5 ,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
我想按照以下元素进行分组:
aResult = [7,3,5]
总结b中的元素(用于概括概率密度函数的示例):
bResult = [0.2 + 0.1 + 0.2,0.1,0.3 + 0.1] = [0.5,0.1,0.4]
或者,在python中随机选择a和b:
导入numpy为np
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)] * len(a))
我有两种方法,当然这两种方法都远离性能较低的边界。
方法1(至少好和短):时间:0.769315958023
def approach_2(a,b):
bResult = [sum(b [i == a])for n in np.unique(a)]
aResult = np.unique(a)
方法2(numpy.groupby,非常慢)时间:4.65299129486
def approach_2(a,b):
tmp = [(a [i],b [i])in range(len(a))]
tmp2 = np.array (tmp,dtype = [('a',float),('b',float)])
tmp2 = np.sort(tmp2,order ='a')
bResult = []
aResult = []
for groupby(tmp2,lambda x:x [0])中的键:
aResult.append(key)
bResult.append (sum([i [1] for i in group]))
Update:Approach3,by Pablo 。时间:1.0265750885
def approach_Pablo(a,b):
pdf = defaultdict(int);
for x,y in zip(a,b):
pdf [x] + = y
更新:由Unutbu提供的方法4。时间:0.184849023819 [WINNER SO FAR,但只有一个整数]
def unique_Unutbu(a,b):
pre>
x = np.bincount(a,权重= b)
aResult = np.unique(a)
bResult = x [aResult]
也许有人找到比我更聪明的解决方案来解决这个问题: >
如果
a
由ints< 2 ** 31-1(也就是说,如果a
的值可以适用于dtypeint32
),那么你可以使用np.bincount
加权:
import numpy as np
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
x = np.bincount(a,重量= b)
print(x)
#[0.0 0. 0.1 0. 0.4 0. 0.5]
print(x [[7,3,5 ]])
#[0.5 0.1 0.4]
np。 unique(a)
返回[3 5 7]
,所以结果以不同的顺序出现:print(x [np.unique(a)])
#[0.1 0.4 0.5]
使用
np.bincount
的一个潜在问题是它返回的数组长度等于最大值在a
中。如果a
甚至包含一个值接近2 ** 31-1的元素,则bincount
必须分配一个数组大小8 *(2 ** 31-1)
字节(或16GiB)。 np.bincount 可能是数组a
的最快解决方案,它们有很大的长度,但不是很大的值。对于长度较小(大或小的值)的数组a
,使用collections.defaultdict
可能会更快。
编辑:请参阅 JF塞巴斯蒂安的解决方案,以解决整数值限制和大值问题。
Given two unordered arrays of same lengths a and b:
a = [7,3,5,7,5,7] b = [0.2,0.1,0.3,0.1,0.1,0.2]
I'd like to group by the elements in a:
aResult = [7,3,5]
summing over the elements in b (Example used to summarize a probability density function):
bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]
Alternatively, random a and b in python:
import numpy as np a = np.random.randint(1,10,10000) b = np.array([1./len(a)]*len(a))
I have two approaches, which for sure are far from the lower performance boundary. Approach 1 (at least nice and short): Time: 0.769315958023
def approach_2(a,b): bResult = [sum(b[i == a]) for i in np.unique(a)] aResult = np.unique(a)
Approach 2 (numpy.groupby, horribly slow) Time: 4.65299129486
def approach_2(a,b): tmp = [(a[i],b[i]) for i in range(len(a))] tmp2 = np.array(tmp, dtype = [('a', float),('b', float)]) tmp2 = np.sort(tmp2, order='a') bResult = [] aResult = [] for key, group in groupby(tmp2, lambda x: x[0]): aResult.append(key) bResult.append(sum([i[1] for i in group]))
Update: Approach3, by Pablo. Time: 1.0265750885
def approach_Pablo(a,b): pdf = defaultdict(int); for x,y in zip(a,b): pdf[x] += y
Update: Approach 4, by Unutbu. Time: 0.184849023819 [WINNER SO FAR, but a as integer only]
def unique_Unutbu(a,b): x=np.bincount(a,weights=b) aResult = np.unique(a) bResult = x[aResult]
Maybe someone finds a smarter solution to this problem than me :)
解决方案If
a
is composed of ints < 2**31-1 (that is, ifa
has values that can fit in dtypeint32
), then you could usenp.bincount
with weights:import numpy as np a = [7,3,5,7,5,7] b = [0.2,0.1,0.3,0.1,0.1,0.2] x=np.bincount(a,weights=b) print(x) # [ 0. 0. 0. 0.1 0. 0.4 0. 0.5] print(x[[7,3,5]]) # [ 0.5 0.1 0.4]
np.unique(a)
returns[3 5 7]
, so the result appears in a different order:print(x[np.unique(a)]) # [ 0.1 0.4 0.5]
One potential problem with using
np.bincount
is that it returns an array whose length is equal to the maximum value ina
. Ifa
contains even one element with value near 2**31-1, thenbincount
would have to allocate an array of size8*(2**31-1)
bytes (or 16GiB).So
np.bincount
might be the fastest solution for arraysa
which have big length, but not big values. For arraysa
which have small length (and big or small values), using acollections.defaultdict
would probably be faster.Edit: See J.F. Sebastian's solution for a way around the integer-values-only restriction and big-values problem.
这篇关于按数组a排序Python,并汇总数组b - 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!