如何在 pandas 数据框中获取行的百分位数? [英] How do I get the percentile for a row in a pandas dataframe?

查看:210
本文介绍了如何在 pandas 数据框中获取行的百分位数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Example DataFrame Values -  

0     78
1     38
2     42
3     48
4     31
5     89
6     94
7    102
8    122
9    122  

stats.percentileofscore(temp['INCOME'].values, 38, kind='mean')
15.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='strict')
10.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='weak')
20.0

stats.percentileofscore(temp['INCOME'].values, 38, kind='rank')
20.0

temp['INCOME'].rank(pct=True)
1    0.20 (Only showing the 38 value index)

temp['INCOME'].quantile(0.11)
37.93

temp['INCOME'].quantile(0.12)
38.31999999999999

Based on the results above, you can see none of the methods are consistent
with the pd.quantiles() method.

我需要为数据帧(255M行)中的每一行获取一列的百分位数,但是找不到返回方法,它们在pd.quantile& np.percentile.

I need to get the percentile for one column for each row in a dataframe (255M rows) but can't find any functions/methods that return the 'linear interpolation' method that they use in pd.quantile & np.percentile.

我尝试了以下方法/功能-

I've tried the following methods/functions -

.rank(pct=True)

此方法仅返回按顺序排列的值,而不使用我要查找的百分位方法.与pd.quantiles

This method only returns the values ranked in order, not using the percentile method that I'm looking for. Inconsistent with pd.quantiles

scipy.stats.percentileofscore  

该方法几乎更接近我要寻找的方法,但由于某种原因仍与线性插值"方法不一致100%. 与此问题相关的问题,没有真正的含义回答

This method almost is closer to what I'm looking for but still is not 100% consistent with the 'linear interpolation' method for some reason. Related question to this problem with no real answer

我已经仔细阅读了与此问题相关的所有SO答案,但没有一个答案使用的插值方法与我所用的相同,因此除非您可以验证它们是否使用相同的插值方法,否则请不要将其标记为重复.方法.

I've looked through every SO answer that is related to this question but none of them use the same interpolation method that I need to use so please do not mark this as a duplicate unless you can verify they're using the same method.

这时,我的最后一个选择是只找到所有100个百分位数的bin截止点,然后以这种方式应用它,或者自己计算线性插值,但这似乎效率很低,并且永远需要花费255M记录.

At this point my last option is to just find the bin cut-offs for all 100 percentiles and apply it that way or calculate the linear interpolation myself but this seems very inefficient and will take forever to apply to 255M records.

还有其他建议吗?

谢谢!

推荐答案

TL; DR

使用

sz = temp['INCOME'].size-1
temp['PCNT_LIN'] = temp['INCOME'].rank(method='max').apply(lambda x: 100.0*(x-1)/sz)

   INCOME    PCNT_LIN
0      78   44.444444
1      38   11.111111
2      42   22.222222
3      48   33.333333
4      31    0.000000
5      89   55.555556
6      94   66.666667
7     102   77.777778
8     122  100.000000
9     122  100.000000

答案

一旦您了解了机械原理,它实际上非常简单.当您在寻找分数的百分位时,您已经在每一行中拥有了分数.剩下的唯一步骤是理解您需要小于或等于所选值的数字的百分位数.这正是scipy.stats.percentileofscore() kind ='weak'DataFrame.rank() method ='average'的参数的作用.为了对其进行反转,请使用 interpolation ='lower'运行Series.quantile().

Answer

It is actually very simple, once your understand the mechanics. When you are looking for percentile of a score, you already have the scores in each row. The only step left is understanding that you need percentile of numbers that are less or equal to the selected value. This is exactly what parameters kind='weak' of scipy.stats.percentileofscore() and method='average' of DataFrame.rank() do. In order to invert it, run Series.quantile() with interpolation='lower'.

因此,scipy.stats.percentileofscore()Series.rank()Series.quantile() 的行为是一致的,请参见以下内容:

So, the behavior of the scipy.stats.percentileofscore(), Series.rank() and Series.quantile() is consistent, see below:

In[]:
temp = pd.DataFrame([  78, 38, 42, 48, 31, 89, 94, 102, 122, 122], columns=['INCOME'])
temp['PCNT_RANK']=temp['INCOME'].rank(method='max', pct=True)
temp['POF']  = temp['INCOME'].apply(lambda x: scipy.stats.percentileofscore(temp['INCOME'], x, kind='weak'))
temp['QUANTILE_VALUE'] = temp['PCNT_RANK'].apply(lambda x: temp['INCOME'].quantile(x, 'lower'))
temp['RANK']=temp['INCOME'].rank(method='max')
sz = temp['RANK'].size - 1 
temp['PCNT_LIN'] = temp['RANK'].apply(lambda x: (x-1)/sz)
temp['CHK'] = temp['PCNT_LIN'].apply(lambda x: temp['INCOME'].quantile(x))

temp

Out[]:
   INCOME  PCNT_RANK    POF  QUANTILE_VALUE  RANK  PCNT_LIN    CHK
0      78        0.5   50.0              78   5.0  0.444444   78.0
1      38        0.2   20.0              38   2.0  0.111111   38.0
2      42        0.3   30.0              42   3.0  0.222222   42.0
3      48        0.4   40.0              48   4.0  0.333333   48.0
4      31        0.1   10.0              31   1.0  0.000000   31.0
5      89        0.6   60.0              89   6.0  0.555556   89.0
6      94        0.7   70.0              94   7.0  0.666667   94.0
7     102        0.8   80.0             102   8.0  0.777778  102.0
8     122        1.0  100.0             122  10.0  1.000000  122.0
9     122        1.0  100.0             122  10.0  1.000000  122.0

现在在列PCNT_RANK中您将获得小于或等于列INCOME列中的值的比率.但是,如果要使用内插"比率,则该比率在列PCNT_LIN中.而且,由于您使用Series.rank()进行计算,因此速度非常快,并且可以在几秒钟内处理255M个数字.

Now in a column PCNT_RANK you get ratio of values that are smaller or equal to the one in a column INCOME. But if you want the "interpolated" ratio, it is in column PCNT_LIN. And as you use Series.rank() for calculations, it is pretty fast and will crunch you 255M numbers in seconds.

在这里,我将解释如何通过将quantile()linear插值一起使用来获得值:

Here I will explain how you get the value from using quantile() with linear interpolation:

temp['INCOME'].quantile(0.11)
37.93

我们的数据temp['INCOME']只有十个值.根据您链接到Wiki 排名第11个百分位是

Our data temp['INCOME'] has only ten values. According to the formula from your link to Wiki the rank of 11th percentile is

rank = 11*(10-1)/100 + 1 = 1.99

rank 的截断部分为1,它对应于值31,排名为2的值(即下一个bin)为38.fraction的值是小数 rank 的一部分.这导致结果:

The truncated part of the rank is 1, which corresponds to the value 31, and the value with the rank 2 (i.e. next bin) is 38. The value of fraction is the fractional part of the rank. This leads to the result:

 31 + (38-31)*(0.99) = 37.93

对于值本身,fraction部分必须为零,因此很容易进行反计算以获得百分位数:

For the values themselves, the fraction part have to be zero, so it is very easy to do the inverse calculation to get percentile:

p = (rank - 1)*100/(10 - 1)

我希望我说得更清楚.

I hope I made it more clear.

这篇关于如何在 pandas 数据框中获取行的百分位数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆