按多列填充缺少的年份分组,并按顺序在 pandas 中水平显示多列 [英] Fill missing year groupy by multiple columns and display multiple columns horizontally in order in Pandas
问题描述
对于以下数据框,我想在每个city
和district
组中填充缺少的年份(从2015年到2017年);然后通过按列分组来计算pct
:city
,district
和year
,在最后一步,然后水平显示value
和pct
列?
For a dataframe as follows, I want to fill missing years (from 2015 to 2017) in each group of city
and district
; then calculate pct
by grouping by columns: city
, district
and year
, at last step, then display value
and pct
columns horizontally?
city district value year
0 sh a 2 2015
1 sh a 3 2016
2 sh b 5 2015
3 sh b 3 2016
4 bj c 4 2015
5 bj c 3 2017
到目前为止我所做的:
1.填写缺少的年份,但仍无法正常工作:
rng = pd.date_range('2015', '2017', freq='YS').dt.year
df = df.apply(lambda x: x.reindex(rng, fill_value = 0))
2.通过按city
和district
分组来计算pct
:
2. Calculating pct
by grouping by city
and district
:
df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()
3.水平显示value
和pct
列,但是我不想要该顺序:
3. Displaying value
and pct
columns horizontally but the order is not I wanted:
df.pivot_table(columns='year', index=['city','district'], values=['value', 'pct'], fill_value='NaN').reset_index()
到目前为止我得到的输出:
The output I get so far:
city district pct value
year 2015 2016 2017 2015 2016 2017
0 bj c NaN NaN -0.25 4.0 NaN 3
1 sh a NaN 0.5 NaN 2.0 3 NaN
2 sh b NaN -0.4 NaN 5.0 3 NaN
我怎么能得到像这样的预期结果?
How could I get the expected result will be like this?
city district 2015 2016 2017
value pct value pct value pct
bj c 4 3
sh a 2 3 0.5
sh b 5 3 -0.4
谢谢.
推荐答案
使用 DataFrame.sort_index
,还为reindex
添加了另一种解决方案:
Use DataFrame.swaplevel
with DataFrame.sort_index
, also added another solution for reindex
:
rng = pd.date_range('2015', '2017', freq='YS').year
c = df['city'].unique()
d = df['district'].unique()
mux = pd.MultiIndex.from_product([c, d, rng], names=['city','district','year'])
df = df.set_index(['city','district','year']).reindex(mux)
df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()
df = df.pivot_table(columns='year',
index=['city','district'],
values=['value', 'pct'],
fill_value='NaN')
df = df.swaplevel(0,1, axis=1).sort_index(axis=1, level=0)
print (df)
year 2015 2016 2017
pct value pct value pct value
city district
bj c NaN 4.0 0.0 NaN -0.25 3
sh a NaN 2.0 0.5 3 0.00 NaN
b NaN 5.0 -0.4 3 0.00 NaN
错误:
ValueError:无法处理非唯一的多索引!
ValueError: cannot handle a non-unique multi-index!
表示传递给groupby的每个列都有重复项,因此这里按['city','district','year']
表示.解决方案是创建唯一值-例如总体而言:
means there are duplicates, per columns passed to groupby, so here by ['city','district','year']
. Solution is create unique values - e.g. by aggregate mean:
print (df)
# city district value year
#0 sh a 2 2015
#0 sh a 20 2015
#1 sh a 3 2016
#2 sh b 5 2015
#3 sh b 3 2016
#4 bj c 4 2015
#5 bj c 3 2017
rng = pd.date_range('2015', '2017', freq='YS').year
c = df['city'].unique()
d = df['district'].unique()
mux = pd.MultiIndex.from_product([c, d, rng], names=['city','district','year'])
print (df.groupby(['city','district','year'])['value'].mean())
city district year
bj c 2015 4
2017 3
sh a 2015 11
2016 3
b 2015 5
2016 3
Name: value, dtype: int64
df = df.groupby(['city','district','year'])['value'].mean().reindex(mux)
print (df)
#city district year
#sh a 2015 11.0
# 2016 3.0
# 2017 NaN
# b 2015 5.0
# 2016 3.0
# 2017 NaN
# c 2015 NaN
# 2016 NaN
# 2017 NaN
#bj a 2015 NaN
# 2016 NaN
# 2017 NaN
# b 2015 NaN
# 2016 NaN
# 2017 NaN
# c 2015 4.0
# 2016 NaN
# 2017 3.0
#Name: value, dtype: float64
这篇关于按多列填充缺少的年份分组,并按顺序在 pandas 中水平显示多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!