在python中对大量数组进行排序的最快方法 [英] Fastest way to sort a large number of arrays in python

查看：856 发布时间：2020/5/18 20:28:10 python performance sorting numpy pandas

本文介绍了在python中对大量数组进行排序的最快方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在python中对大量数组进行排序.我需要一次对超过1100万个数组进行排序.

I am trying to sort a large number of arrays in python. I need to perform the sorting for over 11 million arrays at once.

此外，如果我可以直接获取对数组进行排序的索引，那将是很好的选择.

Also, it would be nice if I could directly get the indices that would sort the array.

这就是为什么，到目前为止，我正在使用numpy.argsort()，但是那在我的机器上太慢了(运行需要一个多小时)

That is why, as of now I'm using numpy.argsort() but thats too slow on my machine (takes over an hour to run)

在同一台机器上，R中的相同操作大约需要15分钟.

The same operation in R is taking about 15 minutes in the same machine.

有人可以告诉我用Python进行此操作的更快方法吗?

Can anyone tell me a faster way to do this in Python?

谢谢

编辑:

添加示例

如果我具有以下数据框:

If I have the following dataframe :

agg:

x      y        w        z  

1      2        2        5                 
1      2        6        7         
3      4        3        3        
5      4        7        8    
3      4        2        5    
5      9        9        9

我正在运行以下功能和命令:

I am running the following function and command on it:

def fucntion(group):
    z = group['z'].values   
    w = group['w'].values 
    func = w[np.argsort(z)[::-1]][:7]  #i need top 7 in case there are many  
    return np.array_str(func)[1:-1]

output = agg.groupby(['x,'y']).apply(function).reset_index()

所以我的输出数据帧将如下所示:

so my output dataframe will look like this:

output:

x   y   w   

1   2   6,2    
3   4   2,3    
5   4   7    
5   9   9

推荐答案

对于某些您对部分排序索引感兴趣的情况，有

Well for cases like those where you are interested in partial sorted indices, there's NumPy's argpartition.

您在w[np.argsort(z)[::-1]][:7]中遇到麻烦的np.argsort，本质上是w[idx]，其中是idx = np.argsort(z)[::-1][:7].

You have the troublesome np.argsort in : w[np.argsort(z)[::-1]][:7], which is essentially w[idx], where idx = np.argsort(z)[::-1][:7].

所以idx可以用np.argpartition计算，就像这样-

So, idx could be calculated with np.argpartition, like so -

idx = np.argpartition(-z,np.arange(7))[:7]

之所以需要-z，是因为默认情况下np.argpartition尝试获取升序排序的索引.因此，要反向，我们已将元素取反.

That -z is needed because by default np.argpartition tries to get sorted indices in ascending order. So, to reverse it, we have negated the elements.

因此，原始代码中的建议更改为:

Thus, the proposed change in the original code would be :

func = w[np.argpartition(-z,np.arange(7))[:7]]

运行时测试-

In [162]: z = np.random.randint(0,10000000,(1100000)) # Random int array

In [163]: idx1 = np.argsort(z)[::-1][:7]
     ...: idx2 = np.argpartition(-z,np.arange(7))[:7]
     ...: 

In [164]: np.allclose(idx1,idx2) # Verify results
Out[164]: True

In [165]: %timeit np.argsort(z)[::-1][:7]
1 loops, best of 3: 264 ms per loop

In [166]: %timeit np.argpartition(-z,np.arange(7))[:7]
10 loops, best of 3: 36.5 ms per loop

这篇关于在python中对大量数组进行排序的最快方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中对大量数组进行排序的最快方法 [英] Fastest way to sort a large number of arrays in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中对大量数组进行排序的最快方法 [英] Fastest way to sort a large number of arrays in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭