按数组a排序Python,并汇总数组b - 性能 [英] Python group by array a, and summarize array b - Performance

查看:253
本文介绍了按数组a排序Python,并汇总数组b - 性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出两个相同长度a和b的无序数组:

  a = [7,3,5,7,5 ,7] 
b = [0.2,0.1,0.3,0.1,0.1,0.2]

我想按照以下元素进行分组:

  aResult = [7,3,5] 

总结b中的元素(用于概括概率密度函数的示例):

  bResult = [0.2 + 0.1 + 0.2,0.1,0.3 + 0.1] = [0.5,0.1,0.4] 

或者,在python中随机选择a和b:

 导入numpy为np 
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)] * len(a))

我有两种方法,当然这两种方法都远离性能较低的边界。
方法1(至少好和短):时间:0.769315958023

  def approach_2(a,b):
bResult = [sum(b [i == a])for n in np.unique(a)]
aResult = np.unique(a)

方法2(numpy.groupby,非常慢)时间:4.65299129486

  def approach_2(a,b):
tmp = [(a [i],b [i])in range(len(a))]
tmp2 = np.array (tmp,dtype = [('a',float),('b',float)])
tmp2 = np.sort(tmp2,order ='a')

bResult = []
aResult = []
for groupby(tmp2,lambda x:x [0])中的键:
aResult.append(key)
bResult.append (sum([i [1] for i in group]))

Update:Approach3,by Pablo 。时间:1.0265750885

  def approach_Pablo(a,b):

pdf = defaultdict(int);
for x,y in zip(a,b):
pdf [x] + = y

更新:由Unutbu提供的方法4。时间:0.184849023819 [WINNER SO FAR,但只有一个整数]

  def unique_Unutbu(a,b):

x = np.bincount(a,权重= b)
aResult = np.unique(a)
bResult = x [aResult]
pre>

也许有人找到比我更聪明的解决方案来解决这个问题: >

如果 a 由ints< 2 ** 31-1(也就是说,如果 a 的值可以适用于dtype int32 ),那么你可以使用 np.bincount 加权:

  import numpy as np 
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]

x = np.bincount(a,重量= b)
print(x)
#[0.0 0. 0.1 0. 0.4 0. 0.5]

print(x [[7,3,5 ]])
#[0.5 0.1 0.4]

np。 unique(a)返回 [3 5 7] ,所以结果以不同的顺序出现:

  print(x [np.unique(a)])
#[0.1 0.4 0.5]

使用 np.bincount 的一个潜在问题是它返回的数组长度等于最大值在 a 中。如果 a 甚至包含一个值接近2 ** 31-1的元素,则 bincount 必须分配一个数组大小 8 *(2 ** 31-1)字节(或16GiB)。 np.bincount 可能是数组 a 的最快解决方案,它们有很大的长度,但不是很大的值。对于长度较小(大或小的值)的数组 a ,使用 collections.defaultdict 可能会更快。



编辑:请参阅 JF塞巴斯蒂安的解决方案,以解决整数值限制和大值问题。


Given two unordered arrays of same lengths a and b:

a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]

I'd like to group by the elements in a:

aResult = [7,3,5]

summing over the elements in b (Example used to summarize a probability density function):

bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]

Alternatively, random a and b in python:

import numpy as np
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)]*len(a))

I have two approaches, which for sure are far from the lower performance boundary. Approach 1 (at least nice and short): Time: 0.769315958023

def approach_2(a,b):
    bResult = [sum(b[i == a]) for i in np.unique(a)]
    aResult = np.unique(a)

Approach 2 (numpy.groupby, horribly slow) Time: 4.65299129486

def approach_2(a,b): 
    tmp = [(a[i],b[i]) for i in range(len(a))]
    tmp2 = np.array(tmp, dtype = [('a', float),('b', float)])
    tmp2 = np.sort(tmp2, order='a') 

    bResult = []
    aResult = []
    for key, group in groupby(tmp2, lambda x: x[0]):
        aResult.append(key)
        bResult.append(sum([i[1] for i in group]))

Update: Approach3, by Pablo. Time: 1.0265750885

def approach_Pablo(a,b):    

    pdf = defaultdict(int); 
    for x,y in zip(a,b):
        pdf[x] += y  

Update: Approach 4, by Unutbu. Time: 0.184849023819 [WINNER SO FAR, but a as integer only]

def unique_Unutbu(a,b):

    x=np.bincount(a,weights=b)
    aResult = np.unique(a)
    bResult = x[aResult]

Maybe someone finds a smarter solution to this problem than me :)

解决方案

If a is composed of ints < 2**31-1 (that is, if a has values that can fit in dtype int32), then you could use np.bincount with weights:

import numpy as np
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]

x=np.bincount(a,weights=b)
print(x)
# [ 0.   0.   0.   0.1  0.   0.4  0.   0.5]

print(x[[7,3,5]])
# [ 0.5  0.1  0.4]

np.unique(a) returns [3 5 7], so the result appears in a different order:

print(x[np.unique(a)])
# [ 0.1  0.4  0.5]

One potential problem with using np.bincount is that it returns an array whose length is equal to the maximum value in a. If a contains even one element with value near 2**31-1, then bincount would have to allocate an array of size 8*(2**31-1) bytes (or 16GiB).

So np.bincount might be the fastest solution for arrays a which have big length, but not big values. For arrays a which have small length (and big or small values), using a collections.defaultdict would probably be faster.

Edit: See J.F. Sebastian's solution for a way around the integer-values-only restriction and big-values problem.

这篇关于按数组a排序Python,并汇总数组b - 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆