Python中计算列表秩向量的有效方法 [英] Efficient method to calculate the rank vector of a list in Python

查看:41
本文介绍了Python中计算列表秩向量的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种在 Python 中计算列表的秩向量的有效方法,类似于 R 的 rank 函数.在元素之间没有联系的简单列表中,列表 l 的秩向量的元素 i 应该是 x 当且仅当 l[i] 是排序列表中的第 x 个元素.到目前为止,这很简单,以下代码片段可以解决问题:

I'm looking for an efficient way to calculate the rank vector of a list in Python, similar to R's rank function. In a simple list with no ties between the elements, element i of the rank vector of a list l should be x if and only if l[i] is the x-th element in the sorted list. This is simple so far, the following code snippet does the trick:

def rank_simple(vector):
    return sorted(range(len(vector)), key=vector.__getitem__)

然而,如果原始列表具有联系(即具有相同值的多个元素),事情就会变得复杂.在这种情况下,所有具有相同值的元素应该具有相同的等级,这是使用上述朴素方法获得的等级的平均值.因此,例如,如果我有 [1, 2, 3, 3, 3, 4, 5], 天真的排名给我 [0, 1, 2, 3, 4,5, 6],但我想要的是[0, 1, 3, 3, 3, 5, 6].在 Python 中执行此操作的最有效方法是哪一种?

Things get complicated, however, if the original list has ties (i.e. multiple elements with the same value). In that case, all the elements having the same value should have the same rank, which is the average of their ranks obtained using the naive method above. So, for instance, if I have [1, 2, 3, 3, 3, 4, 5], the naive ranking gives me [0, 1, 2, 3, 4, 5, 6], but what I would like to have is [0, 1, 3, 3, 3, 5, 6]. Which one would be the most efficient way to do this in Python?

脚注:我不知道 NumPy 是否已经有实现此目的的方法;如果是这样,请告诉我,但无论如何我都会对纯 Python 解决方案感兴趣,因为我正在开发一个工具,该工具也应该在没有 NumPy 的情况下工作.

Footnote: I don't know if NumPy already has a method to achieve this or not; if it does, please let me know, but I would be interested in a pure Python solution anyway as I'm developing a tool which should work without NumPy as well.

推荐答案

使用scipy,你要找的函数是scipy.stats.rankdata:

Using scipy, the function you are looking for is scipy.stats.rankdata:

In [13]: import scipy.stats as ss
In [19]: ss.rankdata([3, 1, 4, 15, 92])
Out[19]: array([ 2.,  1.,  3.,  4.,  5.])

In [20]: ss.rankdata([1, 2, 3, 3, 3, 4, 5])
Out[20]: array([ 1.,  2.,  4.,  4.,  4.,  6.,  7.])

排名从 1 开始,而不是 0(如您的示例中所示),但话说回来,这也是 Rrank 函数的工作方式.

The ranks start at 1, rather than 0 (as in your example), but then again, that's the way R's rank function works as well.

这里是 scipyrankdata 函数:

Here is a pure-python equivalent of scipy's rankdata function:

def rank_simple(vector):
    return sorted(range(len(vector)), key=vector.__getitem__)

def rankdata(a):
    n = len(a)
    ivec=rank_simple(a)
    svec=[a[rank] for rank in ivec]
    sumranks = 0
    dupcount = 0
    newarray = [0]*n
    for i in xrange(n):
        sumranks += i
        dupcount += 1
        if i==n-1 or svec[i] != svec[i+1]:
            averank = sumranks / float(dupcount) + 1
            for j in xrange(i-dupcount+1,i+1):
                newarray[ivec[j]] = averank
            sumranks = 0
            dupcount = 0
    return newarray

print(rankdata([3, 1, 4, 15, 92]))
# [2.0, 1.0, 3.0, 4.0, 5.0]
print(rankdata([1, 2, 3, 3, 3, 4, 5]))
# [1.0, 2.0, 4.0, 4.0, 4.0, 6.0, 7.0]

这篇关于Python中计算列表秩向量的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆