在 pandas GroupBy中将缺少的组键作为NaN包括在内 [英] Include missing group keys as NaN in pandas GroupBy output
问题描述
我的熊猫有一个数据框.
I have a dataframe in pandas.
test_df = pd.DataFrame({'date': ['2018-12-28', '2018-12-28', '2018-12-29', '2018-12-29', '2018-12-30', '2018-12-30'],
'transaction': ['aa', 'bb', 'cc', 'aa', 'bb', 'bb'],
'ccy': ['USD', 'EUR', 'EUR', 'USD', 'USD', 'USD'],
'amt': np.random.random(6)})
test_df:
date transaction ccy amt
2018-12-28 aa USD 0.323439
2018-12-28 bb EUR 0.048948
2018-12-29 cc EUR 0.793263
2018-12-29 aa USD 0.013865
2018-12-30 bb USD 0.658571
2018-12-30 bb USD 0.224951
以下代码为我提供了此输出.
The following code is giving me this output.
grouper = test_df.groupby([pd.Grouper('date'), 'transaction', 'ccy'])
grp_transactions = grouper['amt'].sum().unstack()
输出:
ccy EUR USD
date transaction
2018-12-28 aa NaN 0.323439
bb 0.048948 NaN
2018-12-29 aa NaN 0.013865
cc 0.793263 NaN
2018-12-30 bb NaN 0.883523
我相信这是可以预期的,因为groupby函数将根据上述顺序对列中的值进行分组,进行相应的求和,并且不会为不在DF中的交易创建新行.
I believe this is expected as the groupby function will group values in the columns based on the order above, sum accordingly, and not create new rows for transactions that are not in the DF.
如果使用groupby在特定的一天没有完成交易,熊猫中是否可以包含NaN值? IE.如果我的DF没有事务:cc于2018年12月28日,则两个ccy的输出应均为NaN.
Is there a way in pandas to include NaN values if a transaction is not done on a particular day when using groupby? ie. Output should be NaN for both ccy if my DF does not have transaction: cc on 28/12/2018.
预期输出:
ccy EUR USD
date transaction
2018-12-28 aa NaN 0.323439
bb 0.048948 NaN
cc NaN NaN
2018-12-29 aa NaN 0.013865
bb NaN NaN
cc 0.793263 NaN
2018-12-30 aa NaN NaN
bb NaN 0.883523
cc NaN NaN
任何帮助将不胜感激.谢谢!
Any help would be appreciated. Thanks!
推荐答案
如果在分组之前将交易"转换为分类列,这很容易
This is easy if you convert "transaction" to a categorical column before grouping,
df.transaction = pd.Categorical(df.transaction)
df.groupby(['date', 'transaction', 'ccy']).sum().unstack(2)
amt
ccy EUR USD
date transaction
2018-12-28 aa NaN 0.404488
bb 0.459295 NaN
cc NaN NaN
2018-12-29 aa NaN 0.439354
bb NaN NaN
cc 0.429269 NaN
2018-12-30 aa NaN NaN
bb NaN 1.542451
cc NaN NaN
输出中缺少的类别由NaN表示.在执行数字聚合时通常可以做到这一点.
Missing categories in the output are represented by NaNs. This is usually possible when performing numeric aggregation.
如果您不想修改df
,则可以这样做:
If you don't want to modify df
, this will do:
u = pd.Series(pd.Categorical(df.transaction), name='transaction')
df.groupby(['date', u, 'ccy']).sum().unstack(2)
amt
ccy EUR USD
date transaction
2018-12-28 aa NaN 0.429134
bb 0.852355 NaN
cc NaN NaN
2018-12-29 aa NaN 0.541576
bb NaN NaN
cc 0.994095 NaN
2018-12-30 aa NaN NaN
bb NaN 0.744587
cc NaN NaN
这篇关于在 pandas GroupBy中将缺少的组键作为NaN包括在内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!