numpy数组的排名与可能的重复项 [英] Ranking of numpy array with possible duplicates

查看:116
本文介绍了numpy数组的排名与可能的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个浮点数/整数的小数组,想将其元素映射到它们的行列中.

I have a numpy array of floats/ints and want to map its elements into their ranks.

如果数组没有重复项,则可以通过以下代码解决问题

If an array doesn't have duplicates the problem can be solved by the following code

In [49]: a1
Out[49]: array([ 0.1,  5.1,  2.1,  3.1,  4.1,  1.1,  6.1,  8.1,  7.1,  9.1])

In [50]: a1.argsort().argsort()
Out[50]: array([0, 5, 2, 3, 4, 1, 6, 8, 7, 9])

现在,我想将此方法扩展到具有可能重复项的数组,以便将重复项映射到相同的值.例如,我想要一个数组

Now I want to extend this method to arrays with possible duplicates, so that duplicates are mapped to the same value. For example, I want array a

a2 = np.array([0.1, 1.1, 2.1, 3.1, 4.1, 1.1, 6.1, 7.1, 7.1, 1.1])

要映射到任何一个

0 1 4 5 6 1 7 8 8 1

0 3 4 5 6 3 7 9 9 3

或到

0 2 4 5 6 2 7 8.5 8.5 2

如果只应用a2.argsort().argsort(),则在第一种/第二种情况下,我们将重复项映射到其中的最小/最大级别. 第三种情况只是前两种情况的平均值.

In the first/second case we map duplicates to the minimum/maximum rank among them if we just apply a2.argsort().argsort(). The third case is just the average of first two cases.

有什么建议吗?

编辑(效率要求)

在最初的描述中,我忘了提及时间要求.我正在寻找有关numpy/scipy函数的解决方案,这将避免纯粹的python开销".为了明确起见,请考虑由Richard提出的解决方案,该解决方案实际上可以解决问题,但速度很慢:

In the initial description I forgot to mention about time requirements. I am seeking for solution in terms of numpy/scipy functions which will let to avoid "pure python overhead". Just to make it clear, consider the solution proposed by Richard which actually solves the problem but quite slow:

def argsortdup(a1):
  sorted = np.sort(a1)
  ranked = []
  for item in a1:
    ranked.append(sorted.searchsorted(item))
  return np.array(ranked)

In [86]: a2 = np.array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])

In [87]: %timeit a2.argsort().argsort()
1000000 loops, best of 3: 1.55 us per loop

In [88]: %timeit argsortdup(a2)
10000 loops, best of 3: 25.6 us per loop

In [89]: a = np.arange(0.1, 1000.1)

In [90]: %timeit a.argsort().argsort()
10000 loops, best of 3: 24.5 us per loop

In [91]: %timeit argsortdup(a)
1000 loops, best of 3: 1.14 ms per loop

In [92]: a = np.arange(0.1, 10000.1)

In [93]: %timeit a.argsort().argsort()
1000 loops, best of 3: 303 us per loop

In [94]: %timeit argsortdup(a)
100 loops, best of 3: 11.9 ms per loop

从上面的分析很明显,argsortdup比a.argsort().argsort()慢30到50倍.主要原因是使用python循环和列表.

It is clear from the analysis above that argsortdup is 30-50 times slower than a.argsort().argsort(). The main reason is the use of python loops and lists.

推荐答案

按照注释中的@WarrenWeckesser的建议升级到最新版本的scipy后,scipy.stats.rankdata似乎比scipy.stats.mstats.rankdata是在较大的阵列上执行此操作的最快方法.

After upgrading to a latest version of scipy as suggested @WarrenWeckesser in the comments, scipy.stats.rankdata seems to be faster than both scipy.stats.mstats.rankdata and np.searchsorted being the fastet way to do it on larger arrays.

In [1]: import numpy as np

In [2]: from scipy.stats import rankdata as rd
   ...: from scipy.stats.mstats import rankdata as rd2
   ...: 

In [3]: array = np.arange(0.1, 1000000.1)

In [4]: %timeit np.searchsorted(np.sort(array), array)
1 loops, best of 3: 385 ms per loop

In [5]: %timeit rd(array)
10 loops, best of 3: 109 ms per loop

In [6]: %timeit rd2(array)
1 loops, best of 3: 205 ms per loop

这篇关于numpy数组的排名与可能的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆