向量化A列的B列的百分比值(对于组) [英] vectorize percentile value of column B of column A (for groups)
问题描述
对于每对src
和dest
机场城市,我都希望返回给定列b
的值的列a
的百分位数.
For every pair of src
and dest
airport cities I want to return a percentile of column a
given a value of column b
.
我可以这样手动进行:
仅具有2对src/dest的示例df(我的实际df中有数千对):
example df with only 2 pairs of src/dest (I have thousands in my actual df):
dt src dest a b
0 2016-01-01 YYZ SFO 548.12 279.28
1 2016-01-01 DFW PDX 111.35 -65.50
2 2016-02-01 YYZ SFO 64.84 342.35
3 2016-02-01 DFW PDX 63.81 61.64
4 2016-03-01 YYZ SFO 614.29 262.83
{'a': {0: 548.12,
1: 111.34999999999999,
2: 64.840000000000003,
3: 63.810000000000002,
4: 614.28999999999996,
5: -207.49000000000001,
6: 151.31999999999999,
7: -56.43,
8: 611.37,
9: -296.62,
10: 6417.5699999999997,
11: -376.25999999999999,
12: 465.12,
13: -821.73000000000002,
14: 1270.6700000000001,
15: -1410.0899999999999,
16: 1312.6600000000001,
17: -326.25999999999999,
18: 1683.3699999999999,
19: -24.440000000000001,
20: 583.60000000000002,
21: -5.2400000000000002,
22: 1122.74,
23: 195.21000000000001,
24: 97.040000000000006,
25: 133.94},
'b': {0: 279.27999999999997,
1: -65.5,
2: 342.35000000000002,
3: 61.640000000000001,
4: 262.82999999999998,
5: 115.89,
6: 268.63999999999999,
7: 2.3500000000000001,
8: 91.849999999999994,
9: 62.119999999999997,
10: 778.33000000000004,
11: -142.78,
12: 1675.53,
13: -214.36000000000001,
14: 983.80999999999995,
15: -207.62,
16: 632.13999999999999,
17: -132.53,
18: 422.36000000000001,
19: 13.470000000000001,
20: 642.73000000000002,
21: -144.59999999999999,
22: 213.15000000000001,
23: -50.200000000000003,
24: 338.27999999999997,
25: -129.69},
'dest': {0: 'SFO',
1: 'PDX',
2: 'SFO',
3: 'PDX',
4: 'SFO',
5: 'PDX',
6: 'SFO',
7: 'PDX',
8: 'SFO',
9: 'PDX',
10: 'SFO',
11: 'PDX',
12: 'SFO',
13: 'PDX',
14: 'SFO',
15: 'PDX',
16: 'SFO',
17: 'PDX',
18: 'SFO',
19: 'PDX',
20: 'SFO',
21: 'PDX',
22: 'SFO',
23: 'PDX',
24: 'SFO',
25: 'PDX'},
'dt': {0: Timestamp('2016-01-01 00:00:00'),
1: Timestamp('2016-01-01 00:00:00'),
2: Timestamp('2016-02-01 00:00:00'),
3: Timestamp('2016-02-01 00:00:00'),
4: Timestamp('2016-03-01 00:00:00'),
5: Timestamp('2016-03-01 00:00:00'),
6: Timestamp('2016-04-01 00:00:00'),
7: Timestamp('2016-04-01 00:00:00'),
8: Timestamp('2016-05-01 00:00:00'),
9: Timestamp('2016-05-01 00:00:00'),
10: Timestamp('2016-06-01 00:00:00'),
11: Timestamp('2016-06-01 00:00:00'),
12: Timestamp('2016-07-01 00:00:00'),
13: Timestamp('2016-07-01 00:00:00'),
14: Timestamp('2016-08-01 00:00:00'),
15: Timestamp('2016-08-01 00:00:00'),
16: Timestamp('2016-09-01 00:00:00'),
17: Timestamp('2016-09-01 00:00:00'),
18: Timestamp('2016-10-01 00:00:00'),
19: Timestamp('2016-10-01 00:00:00'),
20: Timestamp('2016-11-01 00:00:00'),
21: Timestamp('2016-11-01 00:00:00'),
22: Timestamp('2016-12-01 00:00:00'),
23: Timestamp('2016-12-01 00:00:00'),
24: Timestamp('2017-01-01 00:00:00'),
25: Timestamp('2017-01-01 00:00:00')},
'src': {0: 'YYZ',
1: 'DFW',
2: 'YYZ',
3: 'DFW',
4: 'YYZ',
5: 'DFW',
6: 'YYZ',
7: 'DFW',
8: 'YYZ',
9: 'DFW',
10: 'YYZ',
11: 'DFW',
12: 'YYZ',
13: 'DFW',
14: 'YYZ',
15: 'DFW',
16: 'YYZ',
17: 'DFW',
18: 'YYZ',
19: 'DFW',
20: 'YYZ',
21: 'DFW',
22: 'YYZ',
23: 'DFW',
24: 'YYZ',
25: 'DFW'}}
我想要每组src
和dest
对的百分位数.因此,每对仅应有1个百分位值.我只想执行给定的b
的百分位,其中每个src
和dest
对的date = 2017-01-01
在每一对的整个列a
中.有道理吗?
I want the percentile per group of src
and dest
pairs. So there should only be 1 percentile value for each pair. I only want to perform the percentile given b
where date = 2017-01-01
for each src
and dest
pair over the entire column a
for each pair. Make sense?
我可以手动执行此操作,例如针对特定的一对i.e. src=YYZ and dest=SFT
:
I can do this manually for example for a specific pair i.e. src=YYZ and dest=SFT
:
from scipy import stats
import datetime as dt
import pandas as pd
p0 = dt.datetime(2017,1,1)
# lets slice df for src=YYZ and dest = SFO
x = df[(df.src =='YYZ') &
(df.dest =='SFO') &
(df.dt ==p0)].b.values[0]
# given B, what percentile does it fall in for the entire column A for YYZ, SFO
stats.percentileofscore(df['a'],x)
61.53846153846154
在上述情况下,我对YYZ和SFO对手动进行了此操作.但是,我的df中有成千上万对.
In the above case, I did this manually for pairs YYZ and SFO. However, I have thousands of pairs in my df.
如何使用pandas features
vectorize
而不是遍历每一对?
How do I vectorize
this using pandas features
rather than looping through every pair?
必须有一种使用groupby
并通过功能使用apply
的方法吗?
There must be a way to use groupby
and use apply
over a function?
我想要的df应该看起来像这样:
My desired df should look something like:
src dest percentile
0 YYZ SFO 61.54
1 DFW PDX 23.07
2 XXX YYY blahblah1
3 AAA BBB blahblah2
...
更新:
我实现了以下内容:
def b_percentile_a(df,x,y,b):
z = df[(df['src'] == x ) & (df['dest'] == y)].a
r = stats.percentileofscore(z,b)
return r
b_vector_df = df[df.dt == p0]
b_vector_df['p0_a_percentile_b'] = \
b_vector_df.apply(lambda x: b_percentile_a(df,x.src,x.dest,x.b), axis=1)
100
对需要花费5.16
秒.我有55,000
对.因此,这将花费~50
分钟.我需要运行36
次,因此要花费several days
的运行时间.
It takes 5.16
seconds for 100
pairs. I have 55,000
pairs. So this will take ~50
minutes. I need to run this 36
times so its going to take several days
of run time.
必须有一个更快的方法吗?
There must be a faster approach?
推荐答案
节省了大量时间!
输出:
a_list的大小:49998随机唯一值
percentile_1(您指定的df-scipy)
计算百分位数104次-0:00:07.777022
Output:
Size of a_list: 49998 Randomized unique values
percentile_1 (Your given df - scipy)
computed percentile 104 times - 104 records in 0:00:07.777022
percentile_9(使用给定df的PercentileOfScore(rank_searchsorted_list)类)
计算百分位数104次-0:00:00.000609中有104条记录
_ dt src dest a b pct scipy _
0: 2016-01-01 YYZ SFO 54812 279.28 74.81299251970079 74.8129925197
1: 2016-01-01 DFW PDX 111.35 -65.5 24.66698667946718 24.6669866795
2: 2016-02-01 YYZ SFO 64.84 342.35 76.4810592423697 76.4810592424
3: 2016-02-01 DFW PDX 63.81 61.64 63.84655386215449 63.8465538622
...
24: 2017-01-01 YYZ SFO 97.04 338.28 76.3570542821712 76.3570542822
25: 2017-01-01 DFW PDX 133.94 -129.69 21.4668586743469 21.4668586743
percentile_9 (class PercentileOfScore(rank_searchsorted_list) using given df)
computed percentile 104 times - 104 records in 0:00:00.000609
_ dt src dest a b pct scipy _
0: 2016-01-01 YYZ SFO 54812 279.28 74.81299251970079 74.8129925197
1: 2016-01-01 DFW PDX 111.35 -65.5 24.66698667946718 24.6669866795
2: 2016-02-01 YYZ SFO 64.84 342.35 76.4810592423697 76.4810592424
3: 2016-02-01 DFW PDX 63.81 61.64 63.84655386215449 63.8465538622
...
24: 2017-01-01 YYZ SFO 97.04 338.28 76.3570542821712 76.3570542822
25: 2017-01-01 DFW PDX 133.94 -129.69 21.4668586743469 21.4668586743
看着scipy.percentileofscore
的实现,我发现整个list( a )
-在每次调用percentileofscore
时被复制,插入,排序和搜索.
Looking at the implementation of scipy.percentileofscore
i found that the whole list( a )
are - copied, inserted, sorted, searched - on every call of percentileofscore
.
我实现了自己的class PercentileOfScore
import numpy as np
class PercentileOfScore(object):
def __init__(self, aList):
self.a = np.array( aList )
self.a.sort()
self.n = float(len(self.a))
self.pct = self.__rank_searchsorted_list
# end def __init__
def __rank_searchsorted_list(self, score_list):
adx = np.searchsorted(self.a, score_list, side='right')
pct = []
for idx in adx:
# Python 2.x needs explicit type casting float(int)
pct.append( (float(idx) / self.n) * 100.0 )
return pct
# end def _rank_searchsorted_list
# end class PercentileOfScore
我认为def percentile_7
不能满足您的需求. dt
不会考虑.
I don't think that def percentile_7
will fit your needs. dt
will not considered.
PctOS = None
def percentile_7(df_flat):
global PctOS
result = {}
for k in df_flat.pair_dict.keys():
# df_flat.pair_dict = { 'src.dst': [b,b,...bn] }
result[k] = PctOS.pct( df_flat.pair_dict[k] )
return result
# end def percentile_7
在您的手动样本中,您将使用整个df.a
.在此示例中,其dt_flat.a_list
,但是我不确定这是否是您想要的吗?
In your manual sample you use the whole df.a
. In this sample its dt_flat.a_list
, but i'm not sure if this is what you want?
from PercentileData import DF_flat
def main():
# DF_flat.data = {'dt.src.dest':[a,b]}
df_flat = DF_flat()
# Instantiate Global PctOS
global PctOS
# df_flat.a_list = [a,a,...an]
PctOS = PercentileOfScore(df_flat.a_list)
result = percentile_7(df_flat)
# result = dict{'src.dst':[pct,pct...pctn]}
使用Python:3.4.2和2.7.9测试-numpy:1.8.2
Tested with Python:3.4.2 and 2.7.9 - numpy: 1.8.2
这篇关于向量化A列的B列的百分比值(对于组)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!