在python中对大量数组进行排序的最快方法 [英] Fastest way to sort a large number of arrays in python

查看:856
本文介绍了在python中对大量数组进行排序的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在python中对大量数组进行排序.我需要一次对超过1100万个数组进行排序.

I am trying to sort a large number of arrays in python. I need to perform the sorting for over 11 million arrays at once.

此外,如果我可以直接获取对数组进行排序的索引,那将是很好的选择.

Also, it would be nice if I could directly get the indices that would sort the array.

这就是为什么,到目前为止,我正在使用numpy.argsort(),但是那在我的机器上太慢了(运行需要一个多小时)

That is why, as of now I'm using numpy.argsort() but thats too slow on my machine (takes over an hour to run)

在同一台机器上,R中的相同操作大约需要15分钟.

The same operation in R is taking about 15 minutes in the same machine.

有人可以告诉我用Python进行此操作的更快方法吗?

Can anyone tell me a faster way to do this in Python?

谢谢

编辑:

添加示例

如果我具有以下数据框:

If I have the following dataframe :

agg:

x      y        w        z  

1      2        2        5                 
1      2        6        7         
3      4        3        3        
5      4        7        8    
3      4        2        5    
5      9        9        9    

我正在运行以下功能和命令:

I am running the following function and command on it:

def fucntion(group):
    z = group['z'].values   
    w = group['w'].values 
    func = w[np.argsort(z)[::-1]][:7]  #i need top 7 in case there are many  
    return np.array_str(func)[1:-1]

output = agg.groupby(['x,'y']).apply(function).reset_index()

所以我的输出数据帧将如下所示:

so my output dataframe will look like this:

output:

x   y   w   

1   2   6,2    
3   4   2,3    
5   4   7    
5   9   9

推荐答案

对于某些您对部分排序索引感兴趣的情况,有

Well for cases like those where you are interested in partial sorted indices, there's NumPy's argpartition.

您在w[np.argsort(z)[::-1]][:7]中遇到麻烦的np.argsort,本质上是w[idx],其中是idx = np.argsort(z)[::-1][:7].

You have the troublesome np.argsort in : w[np.argsort(z)[::-1]][:7], which is essentially w[idx], where idx = np.argsort(z)[::-1][:7].

所以idx可以用np.argpartition计算,就像这样-

So, idx could be calculated with np.argpartition, like so -

idx = np.argpartition(-z,np.arange(7))[:7]

之所以需要-z,是因为默认情况下np.argpartition尝试获取升序排序的索引.因此,要反向,我们已将元素取反.

That -z is needed because by default np.argpartition tries to get sorted indices in ascending order. So, to reverse it, we have negated the elements.

因此,原始代码中的建议更改为:

Thus, the proposed change in the original code would be :

func = w[np.argpartition(-z,np.arange(7))[:7]]

运行时测试-

In [162]: z = np.random.randint(0,10000000,(1100000)) # Random int array

In [163]: idx1 = np.argsort(z)[::-1][:7]
     ...: idx2 = np.argpartition(-z,np.arange(7))[:7]
     ...: 

In [164]: np.allclose(idx1,idx2) # Verify results
Out[164]: True

In [165]: %timeit np.argsort(z)[::-1][:7]
1 loops, best of 3: 264 ms per loop

In [166]: %timeit np.argpartition(-z,np.arange(7))[:7]
10 loops, best of 3: 36.5 ms per loop

这篇关于在python中对大量数组进行排序的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆