在python中对大量数组进行排序的最快方法 [英] Fastest way to sort a large number of arrays in python
问题描述
我正在尝试在python中对大量数组进行排序.我需要一次对超过1100万个数组进行排序.
I am trying to sort a large number of arrays in python. I need to perform the sorting for over 11 million arrays at once.
此外,如果我可以直接获取对数组进行排序的索引,那将是很好的选择.
Also, it would be nice if I could directly get the indices that would sort the array.
这就是为什么,到目前为止,我正在使用numpy.argsort(),但是那在我的机器上太慢了(运行需要一个多小时)
That is why, as of now I'm using numpy.argsort() but thats too slow on my machine (takes over an hour to run)
在同一台机器上,R中的相同操作大约需要15分钟.
The same operation in R is taking about 15 minutes in the same machine.
有人可以告诉我用Python进行此操作的更快方法吗?
Can anyone tell me a faster way to do this in Python?
谢谢
编辑:
添加示例
如果我具有以下数据框:
If I have the following dataframe :
agg:
x y w z
1 2 2 5
1 2 6 7
3 4 3 3
5 4 7 8
3 4 2 5
5 9 9 9
我正在运行以下功能和命令:
I am running the following function and command on it:
def fucntion(group):
z = group['z'].values
w = group['w'].values
func = w[np.argsort(z)[::-1]][:7] #i need top 7 in case there are many
return np.array_str(func)[1:-1]
output = agg.groupby(['x,'y']).apply(function).reset_index()
所以我的输出数据帧将如下所示:
so my output dataframe will look like this:
output:
x y w
1 2 6,2
3 4 2,3
5 4 7
5 9 9
推荐答案
Well for cases like those where you are interested in partial sorted indices, there's NumPy's argpartition
.
您在w[np.argsort(z)[::-1]][:7]
中遇到麻烦的np.argsort
,本质上是w[idx]
,其中是idx = np.argsort(z)[::-1][:7]
.
You have the troublesome np.argsort
in : w[np.argsort(z)[::-1]][:7]
, which is essentially w[idx]
, where idx = np.argsort(z)[::-1][:7]
.
所以idx
可以用np.argpartition
计算,就像这样-
So, idx
could be calculated with np.argpartition
, like so -
idx = np.argpartition(-z,np.arange(7))[:7]
之所以需要-z
,是因为默认情况下np.argpartition
尝试获取升序排序的索引.因此,要反向,我们已将元素取反.
That -z
is needed because by default np.argpartition
tries to get sorted indices in ascending order. So, to reverse it, we have negated the elements.
因此,原始代码中的建议更改为:
Thus, the proposed change in the original code would be :
func = w[np.argpartition(-z,np.arange(7))[:7]]
运行时测试-
In [162]: z = np.random.randint(0,10000000,(1100000)) # Random int array
In [163]: idx1 = np.argsort(z)[::-1][:7]
...: idx2 = np.argpartition(-z,np.arange(7))[:7]
...:
In [164]: np.allclose(idx1,idx2) # Verify results
Out[164]: True
In [165]: %timeit np.argsort(z)[::-1][:7]
1 loops, best of 3: 264 ms per loop
In [166]: %timeit np.argpartition(-z,np.arange(7))[:7]
10 loops, best of 3: 36.5 ms per loop
这篇关于在python中对大量数组进行排序的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!