向量化A列的B列的百分比值(对于组) [英] vectorize percentile value of column B of column A (for groups)

查看：89 发布时间：2020/9/6 5:45:39 python pandas scipy apply percentile

本文介绍了向量化A列的B列的百分比值(对于组)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于每对src和dest机场城市，我都希望返回给定列b的值的列a的百分位数.

For every pair of src and dest airport cities I want to return a percentile of column a given a value of column b.

我可以这样手动进行:

仅具有2对src/dest的示例df(我的实际df中有数千对):

example df with only 2 pairs of src/dest (I have thousands in my actual df):

dt  src dest    a   b
0   2016-01-01  YYZ SFO 548.12  279.28
1   2016-01-01  DFW PDX 111.35  -65.50
2   2016-02-01  YYZ SFO 64.84   342.35
3   2016-02-01  DFW PDX 63.81   61.64
4   2016-03-01  YYZ SFO 614.29  262.83

{'a': {0: 548.12,
  1: 111.34999999999999,
  2: 64.840000000000003,
  3: 63.810000000000002,
  4: 614.28999999999996,
  5: -207.49000000000001,
  6: 151.31999999999999,
  7: -56.43,
  8: 611.37,
  9: -296.62,
  10: 6417.5699999999997,
  11: -376.25999999999999,
  12: 465.12,
  13: -821.73000000000002,
  14: 1270.6700000000001,
  15: -1410.0899999999999,
  16: 1312.6600000000001,
  17: -326.25999999999999,
  18: 1683.3699999999999,
  19: -24.440000000000001,
  20: 583.60000000000002,
  21: -5.2400000000000002,
  22: 1122.74,
  23: 195.21000000000001,
  24: 97.040000000000006,
  25: 133.94},
 'b': {0: 279.27999999999997,
  1: -65.5,
  2: 342.35000000000002,
  3: 61.640000000000001,
  4: 262.82999999999998,
  5: 115.89,
  6: 268.63999999999999,
  7: 2.3500000000000001,
  8: 91.849999999999994,
  9: 62.119999999999997,
  10: 778.33000000000004,
  11: -142.78,
  12: 1675.53,
  13: -214.36000000000001,
  14: 983.80999999999995,
  15: -207.62,
  16: 632.13999999999999,
  17: -132.53,
  18: 422.36000000000001,
  19: 13.470000000000001,
  20: 642.73000000000002,
  21: -144.59999999999999,
  22: 213.15000000000001,
  23: -50.200000000000003,
  24: 338.27999999999997,
  25: -129.69},
 'dest': {0: 'SFO',
  1: 'PDX',
  2: 'SFO',
  3: 'PDX',
  4: 'SFO',
  5: 'PDX',
  6: 'SFO',
  7: 'PDX',
  8: 'SFO',
  9: 'PDX',
  10: 'SFO',
  11: 'PDX',
  12: 'SFO',
  13: 'PDX',
  14: 'SFO',
  15: 'PDX',
  16: 'SFO',
  17: 'PDX',
  18: 'SFO',
  19: 'PDX',
  20: 'SFO',
  21: 'PDX',
  22: 'SFO',
  23: 'PDX',
  24: 'SFO',
  25: 'PDX'},
 'dt': {0: Timestamp('2016-01-01 00:00:00'),
  1: Timestamp('2016-01-01 00:00:00'),
  2: Timestamp('2016-02-01 00:00:00'),
  3: Timestamp('2016-02-01 00:00:00'),
  4: Timestamp('2016-03-01 00:00:00'),
  5: Timestamp('2016-03-01 00:00:00'),
  6: Timestamp('2016-04-01 00:00:00'),
  7: Timestamp('2016-04-01 00:00:00'),
  8: Timestamp('2016-05-01 00:00:00'),
  9: Timestamp('2016-05-01 00:00:00'),
  10: Timestamp('2016-06-01 00:00:00'),
  11: Timestamp('2016-06-01 00:00:00'),
  12: Timestamp('2016-07-01 00:00:00'),
  13: Timestamp('2016-07-01 00:00:00'),
  14: Timestamp('2016-08-01 00:00:00'),
  15: Timestamp('2016-08-01 00:00:00'),
  16: Timestamp('2016-09-01 00:00:00'),
  17: Timestamp('2016-09-01 00:00:00'),
  18: Timestamp('2016-10-01 00:00:00'),
  19: Timestamp('2016-10-01 00:00:00'),
  20: Timestamp('2016-11-01 00:00:00'),
  21: Timestamp('2016-11-01 00:00:00'),
  22: Timestamp('2016-12-01 00:00:00'),
  23: Timestamp('2016-12-01 00:00:00'),
  24: Timestamp('2017-01-01 00:00:00'),
  25: Timestamp('2017-01-01 00:00:00')},
 'src': {0: 'YYZ',
  1: 'DFW',
  2: 'YYZ',
  3: 'DFW',
  4: 'YYZ',
  5: 'DFW',
  6: 'YYZ',
  7: 'DFW',
  8: 'YYZ',
  9: 'DFW',
  10: 'YYZ',
  11: 'DFW',
  12: 'YYZ',
  13: 'DFW',
  14: 'YYZ',
  15: 'DFW',
  16: 'YYZ',
  17: 'DFW',
  18: 'YYZ',
  19: 'DFW',
  20: 'YYZ',
  21: 'DFW',
  22: 'YYZ',
  23: 'DFW',
  24: 'YYZ',
  25: 'DFW'}}

我想要每组src和dest对的百分位数.因此，每对仅应有1个百分位值.我只想执行给定的b的百分位，其中每个src和dest对的date = 2017-01-01在每一对的整个列a中.有道理吗?

I want the percentile per group of src and dest pairs. So there should only be 1 percentile value for each pair. I only want to perform the percentile given b where date = 2017-01-01 for each src and dest pair over the entire column a for each pair. Make sense?

我可以手动执行此操作，例如针对特定的一对i.e. src=YYZ and dest=SFT:

I can do this manually for example for a specific pair i.e. src=YYZ and dest=SFT:

from scipy import stats
import datetime as dt
import pandas as pd

p0 = dt.datetime(2017,1,1)

# lets slice df for src=YYZ and dest = SFO
x = df[(df.src =='YYZ') &
(df.dest =='SFO') &
(df.dt ==p0)].b.values[0]

# given B, what percentile does it fall in for the entire column A for YYZ, SFO
stats.percentileofscore(df['a'],x)
61.53846153846154

在上述情况下，我对YYZ和SFO对手动进行了此操作.但是，我的df中有成千上万对.

In the above case, I did this manually for pairs YYZ and SFO. However, I have thousands of pairs in my df.

如何使用pandas features vectorize而不是遍历每一对?

How do I vectorize this using pandas features rather than looping through every pair?

必须有一种使用groupby并通过功能使用apply的方法吗?

There must be a way to use groupby and use apply over a function?

我想要的df应该看起来像这样:

My desired df should look something like:

    src dest  percentile
0   YYZ SFO   61.54
1   DFW PDX   23.07
2   XXX YYY   blahblah1
3   AAA BBB   blahblah2
...

更新:

我实现了以下内容:

def b_percentile_a(df,x,y,b):
    z = df[(df['src'] == x ) & (df['dest'] == y)].a
    r = stats.percentileofscore(z,b)
    return r

b_vector_df = df[df.dt == p0]

b_vector_df['p0_a_percentile_b'] = \
    b_vector_df.apply(lambda x: b_percentile_a(df,x.src,x.dest,x.b), axis=1)

100对需要花费5.16秒.我有55,000对.因此，这将花费~50分钟.我需要运行36次，因此要花费several days的运行时间.

It takes 5.16 seconds for 100 pairs. I have 55,000 pairs. So this will take ~50 minutes. I need to run this 36 times so its going to take several days of run time.

必须有一个更快的方法吗?

There must be a faster approach?

推荐答案

节省了大量时间！

输出:
a_list的大小:49998随机唯一值
percentile_1(您指定的df-scipy)
计算百分位数104次-0:00:07.777022

Output:
Size of a_list: 49998 Randomized unique values
percentile_1 (Your given df - scipy)
computed percentile 104 times - 104 records in 0:00:07.777022

percentile_9(使用给定df的PercentileOfScore(rank_searchsorted_list)类)
计算百分位数104次-0:00:00.000609中有104条记录
_ dt src dest a b pct scipy _ 0: 2016-01-01 YYZ SFO 54812 279.28 74.81299251970079 74.8129925197 1: 2016-01-01 DFW PDX 111.35 -65.5 24.66698667946718 24.6669866795 2: 2016-02-01 YYZ SFO 64.84 342.35 76.4810592423697 76.4810592424 3: 2016-02-01 DFW PDX 63.81 61.64 63.84655386215449 63.8465538622 ... 24: 2017-01-01 YYZ SFO 97.04 338.28 76.3570542821712 76.3570542822 25: 2017-01-01 DFW PDX 133.94 -129.69 21.4668586743469 21.4668586743

percentile_9 (class PercentileOfScore(rank_searchsorted_list) using given df)
computed percentile 104 times - 104 records in 0:00:00.000609
_ dt src dest a b pct scipy _ 0: 2016-01-01 YYZ SFO 54812 279.28 74.81299251970079 74.8129925197 1: 2016-01-01 DFW PDX 111.35 -65.5 24.66698667946718 24.6669866795 2: 2016-02-01 YYZ SFO 64.84 342.35 76.4810592423697 76.4810592424 3: 2016-02-01 DFW PDX 63.81 61.64 63.84655386215449 63.8465538622 ... 24: 2017-01-01 YYZ SFO 97.04 338.28 76.3570542821712 76.3570542822 25: 2017-01-01 DFW PDX 133.94 -129.69 21.4668586743469 21.4668586743

看着scipy.percentileofscore的实现，我发现整个list( a ) -在每次调用percentileofscore时被复制，插入，排序和搜索.

Looking at the implementation of scipy.percentileofscore i found that the whole list( a ) are - copied, inserted, sorted, searched - on every call of percentileofscore.

我实现了自己的class PercentileOfScore

import numpy as np
class PercentileOfScore(object):

    def __init__(self, aList):
        self.a = np.array( aList )
        self.a.sort()
        self.n = float(len(self.a))
        self.pct = self.__rank_searchsorted_list
    # end def __init__

    def __rank_searchsorted_list(self, score_list):
        adx = np.searchsorted(self.a, score_list, side='right')
        pct = []
        for idx in adx:
            # Python 2.x needs explicit type casting float(int)
            pct.append( (float(idx) / self.n) * 100.0 )

        return pct
    # end def _rank_searchsorted_list
# end class PercentileOfScore

我认为def percentile_7不能满足您的需求. dt不会考虑.

I don't think that def percentile_7 will fit your needs. dt will not considered.

PctOS = None
def percentile_7(df_flat):
    global PctOS
    result = {}
    for k in df_flat.pair_dict.keys():
        # df_flat.pair_dict = { 'src.dst': [b,b,...bn] }
        result[k] = PctOS.pct( df_flat.pair_dict[k] )

    return result
# end def percentile_7

在您的手动样本中，您将使用整个df.a.在此示例中，其dt_flat.a_list，但是我不确定这是否是您想要的吗?

In your manual sample you use the whole df.a. In this sample its dt_flat.a_list, but i'm not sure if this is what you want?

from PercentileData import DF_flat
def main():
    # DF_flat.data = {'dt.src.dest':[a,b]}
    df_flat = DF_flat()

    # Instantiate Global PctOS
    global PctOS
    # df_flat.a_list = [a,a,...an]
    PctOS = PercentileOfScore(df_flat.a_list)

    result = percentile_7(df_flat)
    # result = dict{'src.dst':[pct,pct...pctn]}

使用Python:3.4.2和2.7.9测试-numpy:1.8.2

Tested with Python:3.4.2 and 2.7.9 - numpy: 1.8.2

这篇关于向量化A列的B列的百分比值(对于组)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

向量化A列的B列的百分比值(对于组) [英] vectorize percentile value of column B of column A (for groups)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

向量化A列的B列的百分比值(对于组) [英] vectorize percentile value of column B of column A (for groups)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭