将每个列表值映射到其相应的百分位 [英] Map each list value to its corresponding percentile

查看:19
本文介绍了将每个列表值映射到其相应的百分位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个函数,该函数将一个(已排序的)列表作为其参数,并输出一个包含每个元素对应百分位数的列表.

例如,fn([1,2,3,4,17]) 返回 [0.0, 0.25, 0.50, 0.75, 1.00].

任何人都可以:

  1. 帮我更正下面的代码?或
  2. 提供比我的代码更好的替代方法,用于将列表中的值映射到相应的百分位数?

我当前的代码:

def 中位数(mylist):长度 = len(mylist)如果不是长度 % 2:返回 (mylist[length/2] + mylist[length/2 - 1])/2.0返回 mylist[长度/2]############################################################################ 百分比函数###########################################################################定义百分位数(x):"""找到每个值相对于值列表的对应百分位数.其中 x 是值列表输入列表应该已经排序了!"""# 对输入列表进行排序# list_sorted = x.sort()# 计算列表中元素的数量list_elementCount = len(x)#从列表中获取一组值listFromSetFromList = list(set(x))# 计算列表中唯一元素的数量list_uniqueElementCount = len(set(x))# 定义极端分位数百分位数零 = min(x)百分位数 = 最大值(x)# 定义中位数mdn = 中位数(x)# 创建空列表来保存百分位数x_percentile = [0.00] * list_elementCount# 初始化唯一计数uCount = 0对于我在范围内(list_elementCount):如果 x[i] == 百分位数零:x_percentile[i] = 0.00elif x[i] == percentileHundred:x_percentile[i] = 1.00elif x[i] == mdn:x_percentile[i] = 0.50别的:subList_elementCount = 0对于范围(i)中的 j:如果 x[j] 

目前,如果我提交percentile([1,2,3,4,17]),列表[0.0, 0.0, 0.5, 0.0, 1.0]被退回.

解决方案

我认为您的示例输入/输出与计算百分位数的典型方法不符.如果您将百分位数计算为严格小于该值的数据点比例",则最高值应为 0.8(因为 5 个值中有 4 个小于最大值).如果将其计算为小于或等于该值的数据点百分比",则底部值应为 0.2(因为 5 个值中的 1 个等于最小的值).因此,百分位数将是 [0, 0.2, 0.4, 0.6, 0.8][0.2, 0.4, 0.6, 0.8, 1].您的定义似乎是严格小于此值的数据点数,被视为不等于此值的数据点数的比例",但根据我的经验,这不是一个常见定义(参见例如 维基百科).

对于典型的百分位数定义,数据点的百分位数等于其排名除以数据点的数量.(例如参见 Stats SE 上的这个问题,询问如何做同样的事情在 R 中.)如何计算百分位数的差异与如何计算排名的差异(例如,如何对绑定值进行排名).scipy.stats.percentileofscore 函数提供了四种计算百分位数的方法:

<预><代码>>>>x = [1, 1, 2, 2, 17]>>>[stats.percentileofscore(x, a, 'rank') for a in x][30.0, 30.0, 70.0, 70.0, 100.0]>>>[stats.percentileofscore(x, a, 'weak') for a in x][40.0, 40.0, 80.0, 80.0, 100.0]>>>[stats.percentileofscore(x, a, 'strict') for a in x][0.0, 0.0, 40.0, 40.0, 80.0]>>>[stats.percentileofscore(x, a, 'mean') for a in x][20.0, 20.0, 60.0, 60.0, 90.0]

(我使用了一个包含关系的数据集来说明在这种情况下会发生什么.)

排名"方法为并列组分配的排名等于他们将涵盖的排名的平均值(即,第二名的三向平局获得 3 的排名,因为它占据"排名 2、3 和4).弱"方法根据数据点小于或等于给定点的比例分配一个百分位数;严格"是相同的,但计算点的比例严格小于给定点.均值"法是后两者的平均值.

正如 Kevin H. Lin 指出的那样,在循环中调用 percentileofscore 是低效的,因为它必须在每次通过时重新计算排名.但是,可以使用 scipy.stats.rankdata,让你一次计算所有的百分位数:

<预><代码>>>>来自 scipy 导入统计>>>stats.rankdata(x,平均")/len(x)数组([ 0.3, 0.3, 0.7, 0.7, 1. ])>>>stats.rankdata(x, 'max')/len(x)数组([ 0.4, 0.4, 0.8, 0.8, 1. ])>>>(stats.rankdata(x, 'min')-1)/len(x)数组([ 0. , 0. , 0.4, 0.4, 0.8])

在最后一种情况下,排名向下调整 1,使它们从 0 开始,而不是从 1 开始.(我省略了均值",但可以通过对后两种方法的结果求平均值来轻松获得.)

我做了一些计时.对于像您的示例中那样的小数据,使用 rankdata 比 Kevin H. Lin 的解决方案要慢一些(大概是由于 scipy 在将事物转换为 numpy 数组时产生的开销)但比调用快percentileofscore 在循环中,如 reptilicus 的回答:

In [11]: %timeit [stats.percentileofscore(x, i) for i in x]1000 个循环,最好的 3 个:每个循环 414 µs在 [12]: %timeit list_to_percentiles(x)100000 个循环,最好的 3 个:每个循环 11.1 µs在 [13]: %timeit stats.rankdata(x, "average")/len(x)10000 个循环,最好的 3 个:每个循环 39.3 µs

然而,对于大型数据集,numpy 的性能优势发挥作用,使用 rankdata 比 Kevin 的 list_to_percentiles 快 10 倍:

在[18]中:x = np.random.randint(0, 10000, 1000)在 [19]: %timeit [stats.percentileofscore(x, i) for i in x]1 个循环,最好的 3 个:每个循环 437 毫秒在 [20] 中:%timeit list_to_percentiles(x)100 个循环,最好的 3 个:每个循环 1.08 毫秒在 [21]: %timeit stats.rankdata(x, "average")/len(x)10000 个循环,最好的 3 个:每个循环 102 µs

这种优势只会在越来越大的数据集上变得更加明显.

I'd like to create a function that takes a (sorted) list as its argument and outputs a list containing each element's corresponding percentile.

For example, fn([1,2,3,4,17]) returns [0.0, 0.25, 0.50, 0.75, 1.00].

Can anyone please either:

  1. Help me correct my code below? OR
  2. Offer a better alternative than my code for mapping values in a list to their corresponding percentiles?

My current code:

def median(mylist):
    length = len(mylist)
    if not length % 2:
        return (mylist[length / 2] + mylist[length / 2 - 1]) / 2.0
    return mylist[length / 2]

###############################################################################
# PERCENTILE FUNCTION
###############################################################################

def percentile(x):
    """
    Find the correspoding percentile of each value relative to a list of values.
    where x is the list of values
    Input list should already be sorted!
    """

    # sort the input list
    # list_sorted = x.sort()

    # count the number of elements in the list
    list_elementCount = len(x)

    #obtain set of values from list

    listFromSetFromList = list(set(x))

    # count the number of unique elements in the list
    list_uniqueElementCount = len(set(x))

    # define extreme quantiles
    percentileZero    = min(x)
    percentileHundred = max(x)

    # define median quantile
    mdn = median(x) 

    # create empty list to hold percentiles
    x_percentile = [0.00] * list_elementCount 

    # initialize unique count
    uCount = 0

    for i in range(list_elementCount):
        if x[i] == percentileZero:
            x_percentile[i] = 0.00
        elif x[i] == percentileHundred:
            x_percentile[i] = 1.00
        elif x[i] == mdn:
            x_percentile[i] = 0.50 
        else:
            subList_elementCount = 0
            for j in range(i):
                if x[j] < x[i]:
                    subList_elementCount = subList_elementCount + 1 
            x_percentile[i] = float(subList_elementCount / list_elementCount)
            #x_percentile[i] = float(len(x[x > listFromSetFromList[uCount]]) / list_elementCount)
            if i == 0:
                continue
            else:
                if x[i] == x[i-1]:
                    continue
                else:
                    uCount = uCount + 1
    return x_percentile

Currently, if I submit percentile([1,2,3,4,17]), the list [0.0, 0.0, 0.5, 0.0, 1.0] is returned.

解决方案

I think your example input/output does not correspond to typical ways of calculating percentile. If you calculate the percentile as "proportion of data points strictly less than this value", then the top value should be 0.8 (since 4 of 5 values are less than the largest one). If you calculate it as "percent of data points less than or equal to this value", then the bottom value should be 0.2 (since 1 of 5 values equals the smallest one). Thus the percentiles would be [0, 0.2, 0.4, 0.6, 0.8] or [0.2, 0.4, 0.6, 0.8, 1]. Your definition seems to be "the number of data points strictly less than this value, considered as a proportion of the number of data points not equal to this value", but in my experience this is not a common definition (see for instance wikipedia).

With the typical percentile definitions, the percentile of a data point is equal to its rank divided by the number of data points. (See for instance this question on Stats SE asking how to do the same thing in R.) Differences in how to compute the percentile amount to differences in how to compute the rank (for instance, how to rank tied values). The scipy.stats.percentileofscore function provides four ways of computing percentiles:

>>> x = [1, 1, 2, 2, 17]
>>> [stats.percentileofscore(x, a, 'rank') for a in x]
[30.0, 30.0, 70.0, 70.0, 100.0]
>>> [stats.percentileofscore(x, a, 'weak') for a in x]
[40.0, 40.0, 80.0, 80.0, 100.0]
>>> [stats.percentileofscore(x, a, 'strict') for a in x]
[0.0, 0.0, 40.0, 40.0, 80.0]
>>> [stats.percentileofscore(x, a, 'mean') for a in x]
[20.0, 20.0, 60.0, 60.0, 90.0]

(I used a dataset containing ties to illustrate what happens in such cases.)

The "rank" method assigns tied groups a rank equal to the average of the ranks they would cover (i.e., a three-way tie for 2nd place gets a rank of 3 because it "takes up" ranks 2, 3 and 4). The "weak" method assigns a percentile based on the proportion of data points less than or equal to a given point; "strict" is the same but counts proportion of points strictly less than the given point. The "mean" method is the average of the latter two.

As Kevin H. Lin noted, calling percentileofscore in a loop is inefficient since it has to recompute the ranks on every pass. However, these percentile calculations can be easily replicated using different ranking methods provided by scipy.stats.rankdata, letting you calculate all the percentiles at once:

>>> from scipy import stats
>>> stats.rankdata(x, "average")/len(x)
array([ 0.3,  0.3,  0.7,  0.7,  1. ])
>>> stats.rankdata(x, 'max')/len(x)
array([ 0.4,  0.4,  0.8,  0.8,  1. ])
>>> (stats.rankdata(x, 'min')-1)/len(x)
array([ 0. ,  0. ,  0.4,  0.4,  0.8])

In the last case the ranks are adjusted down by one to make them start from 0 instead of 1. (I've omitted "mean", but it could easily be obtained by averaging the results of the latter two methods.)

I did some timings. With small data such as that in your example, using rankdata is somewhat slower than Kevin H. Lin's solution (presumably due to the overhead scipy incurs in converting things to numpy arrays under the hood) but faster than calling percentileofscore in a loop as in reptilicus's answer:

In [11]: %timeit [stats.percentileofscore(x, i) for i in x]
1000 loops, best of 3: 414 µs per loop

In [12]: %timeit list_to_percentiles(x)
100000 loops, best of 3: 11.1 µs per loop

In [13]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 39.3 µs per loop

With a large dataset, however, the performance advantage of numpy takes effect and using rankdata is 10 times faster than Kevin's list_to_percentiles:

In [18]: x = np.random.randint(0, 10000, 1000)

In [19]: %timeit [stats.percentileofscore(x, i) for i in x]
1 loops, best of 3: 437 ms per loop

In [20]: %timeit list_to_percentiles(x)
100 loops, best of 3: 1.08 ms per loop

In [21]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 102 µs per loop

This advantage will only become more pronounced on larger and larger datasets.

这篇关于将每个列表值映射到其相应的百分位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆