pandas :交叉列在多个列上,然后分组 [英] Pandas: Crosstab on multiple columns then Groupby
问题描述
我有一个像这样的数据框.
I have the dataframe that looks like this.
df
visit_date sex region status
0 2019-04-01 m as pass
1 2019-04-02 m as pass
2 2019-04-02 f na pass
3 2019-04-03 f na fail
4 2019-04-08 f na pass
5 2019-04-09 f as pass
6 2019-04-09 m na pass
7 2019-04-10 m as fail
8 2019-04-15 f as fail
9 2019-04-15 m na pass
10 2019-04-16 f na pass
11 2019-04-17 f na fail
visit_date
是datetime
,其余的是categorical
(对象).
visit_date
is datetime
and the rest are categorical
(object).
我想每周统计每列中的每个值,然后将它们设置为列.
I want to count each value in each column by weekly then set them to columns.
预期结果.
f m as na fail pass
visit_date
2019-04-07 2 2 2 2 1 3
2019-04-14 2 2 2 2 1 3
2019-04-21 3 1 1 3 2 2
我使用了pd.crosstab
和groupby
.
df.visit_date = pd.to_datetime(df.visit_date)
cols = ['sex', 'region', 'status']
df2 = pd.crosstab(df['visit_date'], df[cols[0]])
for i in range(1, len(cols)):
df2 = df2.join(pd.crosstab(df['visit_date'], df[cols[i]]))
df2.groupby([pd.Grouper(level='visit_date', freq='1W')]).sum()
f m as na fail pass
visit_date
2019-04-07 2 2 2 2 1 3
2019-04-14 2 2 2 2 1 3
2019-04-21 3 1 1 3 2 2
问题是我必须再次分组,以每周对它们进行汇总,而且速度太慢.我的实际数据有〜100 +列和〜100万+行.
The problem is I have to groupby again to sum them up in weekly and it too slow. My actual data have ~100+ columns and ~1 million+ rows.
有没有更快的方法来获得相同的结果?
Is there any faster way to get the same result?
推荐答案
Use DataFrame.melt
with DataFrameGroupBy.size
and Series.unstack
for reshape:
cols = ['sex', 'region', 'status']
df1 = (df.melt(id_vars='visit_date', value_vars=cols)
.groupby([pd.Grouper(key='visit_date', freq='1W'),'value'])
.size()
.unstack(fill_value=0))
print (df1)
value as f fail m na pass
visit_date
2019-04-07 2 2 1 2 2 3
2019-04-14 2 2 1 2 2 3
2019-04-21 1 3 2 1 3 2
也可以创建Multiindex
:
cols = ['sex', 'region', 'status']
df2 = (df.melt(id_vars='visit_date', value_vars=cols)
.groupby([pd.Grouper(key='visit_date', freq='1W'),'variable', 'value'])
.size()
.unstack(level=[1,2], fill_value=0))
print (df2)
variable region sex status
value as na f m fail pass
visit_date
2019-04-07 2 2 2 2 1 3
2019-04-14 2 2 2 2 1 3
2019-04-21 1 3 3 1 2 2
因此可以通过第一级进行选择:
So possible select by first level:
print (df2['region'])
value as na
visit_date
2019-04-07 2 2
2019-04-14 2 2
2019-04-21 1 3
或展平值:
df2.columns = df2.columns.map('_'.join)
print (df2)
region_as region_na sex_f sex_m status_fail status_pass
visit_date
2019-04-07 2 2 2 2 1 3
2019-04-14 2 2 2 2 1 3
2019-04-21 1 3 3 1 2 2
这篇关于 pandas :交叉列在多个列上,然后分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!