在Python中计算列表排名向量的有效方法 [英] Efficient method to calculate the rank vector of a list in Python

查看:337
本文介绍了在Python中计算列表排名向量的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种有效的方法来计算Python中列表的秩向量,类似于R的rank函数.在元素之间没有联系的简单列表中,当且仅当l[i]是<排序列表中的第em> x 个元素.到目前为止,这很简单,以下代码片段可以解决问题:

I'm looking for an efficient way to calculate the rank vector of a list in Python, similar to R's rank function. In a simple list with no ties between the elements, element i of the rank vector of a list l should be x if and only if l[i] is the x-th element in the sorted list. This is simple so far, the following code snippet does the trick:

def rank_simple(vector):
    return sorted(range(len(vector)), key=vector.__getitem__)

但是,如果原始列表具有联系(即,多个具有相同值的元素),事情将变得复杂.在这种情况下,所有具有相同值的元素都应具有相同的等级,这是使用上述朴素方法获得的等级的平均值.因此,例如,如果我有[1, 2, 3, 3, 3, 4, 5],那么幼稚的排名会给我[0, 1, 2, 3, 4, 5, 6],但是我想拥有的是[0, 1, 3, 3, 3, 5, 6].哪一种方法是用Python执行此操作最有效的方法?

Things get complicated, however, if the original list has ties (i.e. multiple elements with the same value). In that case, all the elements having the same value should have the same rank, which is the average of their ranks obtained using the naive method above. So, for instance, if I have [1, 2, 3, 3, 3, 4, 5], the naive ranking gives me [0, 1, 2, 3, 4, 5, 6], but what I would like to have is [0, 1, 3, 3, 3, 5, 6]. Which one would be the most efficient way to do this in Python?

脚注:我不知道NumPy是否已经有实现此目的的方法.如果可以的话,请让我知道,但是无论如何,我将对纯Python解决方案感兴趣,因为我正在开发一种在没有NumPy的情况下也可以使用的工具.

Footnote: I don't know if NumPy already has a method to achieve this or not; if it does, please let me know, but I would be interested in a pure Python solution anyway as I'm developing a tool which should work without NumPy as well.

推荐答案

使用scipy,您要查找的功能是scipy.stats.rankdata:

Using scipy, the function you are looking for is scipy.stats.rankdata :

In [13]: import scipy.stats as ss
In [19]: ss.rankdata([3, 1, 4, 15, 92])
Out[19]: array([ 2.,  1.,  3.,  4.,  5.])

In [20]: ss.rankdata([1, 2, 3, 3, 3, 4, 5])
Out[20]: array([ 1.,  2.,  4.,  4.,  4.,  6.,  7.])

排名从1开始,而不是从0开始(如您的示例),但随后又是Rrank函数正常工作的方式.

The ranks start at 1, rather than 0 (as in your example), but then again, that's the way R's rank function works as well.

这是scipy的rankdata函数的纯Python等效项:

Here is a pure-python equivalent of scipy's rankdata function:

def rank_simple(vector):
    return sorted(range(len(vector)), key=vector.__getitem__)

def rankdata(a):
    n = len(a)
    ivec=rank_simple(a)
    svec=[a[rank] for rank in ivec]
    sumranks = 0
    dupcount = 0
    newarray = [0]*n
    for i in xrange(n):
        sumranks += i
        dupcount += 1
        if i==n-1 or svec[i] != svec[i+1]:
            averank = sumranks / float(dupcount) + 1
            for j in xrange(i-dupcount+1,i+1):
                newarray[ivec[j]] = averank
            sumranks = 0
            dupcount = 0
    return newarray

print(rankdata([3, 1, 4, 15, 92]))
# [2.0, 1.0, 3.0, 4.0, 5.0]
print(rankdata([1, 2, 3, 3, 3, 4, 5]))
# [1.0, 2.0, 4.0, 4.0, 4.0, 6.0, 7.0]

这篇关于在Python中计算列表排名向量的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆