为什么 pandas 不允许在groupby中使用分类列? [英] Why doesn't pandas allow a categorical column to be used in groupby?

查看:84
本文介绍了为什么 pandas 不允许在groupby中使用分类列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个自定义排序的DataFrame.为此,我使用了pandas.Categorical(),但是如果我随后在groupby中使用它的结果,则返回NAN值.

I would like to create a custom sorted DataFrame. To do this I have used pandas.Categorical() however if I then use the result of this in a groupby NAN values are returned.

# import the pandas module
import pandas as pd

# Create an example dataframe
raw_data = {'Date': ['2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13','2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13'],
        'Portfolio': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
        'Duration': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3],
        'Yield': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1],}

df = pd.DataFrame(raw_data, columns = ['Date', 'Portfolio', 'Duration', 'Yield'])

df['Portfolio'] = pd.Categorical(df['Portfolio'],['C', 'B', 'A'])
df=df.sort_values('Portfolio')

dfs = df.groupby(['Date','Portfolio'], as_index =False).sum()

print(dfs)

                        Date    Portfolio   Duration   Yield
Date        Portfolio               
13/05/2016  C           NaN     NaN         NaN        NaN
            B           NaN     NaN         NaN        NaN
            A           NaN     NaN         NaN        NaN

为什么会这样,我该如何克服呢?

Why is this and how can I overcome this?

还提出了SettingWithCopyWarning,对于分类"是否有更好的成语?

Also SettingWithCopyWarning is raised is there a better idiom for Categorical?

推荐答案

as_index=False搞砸了.如果我只运行:

as_index=False is messing something up. If I run just:

dfs = df.groupby(['Date','Portfolio']).sum()

我得到:

                      Duration  Yield
Date       Portfolio                 
2016-05-13 C                18    6.0
           B                10   10.0
           A                 6    1.8

我不知道为什么会这样.可能是一个错误.

I don't know why this is. It may be a bug.

如果您真的想要没有索引的结果,而只将'Date''Portfolio'作为列,则使用'reset_index()'.

If you really wanted the result without the index and just have 'Date' and 'Portfolio' as columns then use 'reset_index()'.

dfs = df.groupby(['Date','Portfolio']).sum().reset_index()

         Date Portfolio  Duration  Yield
0  2016-05-13         C        18    6.0
1  2016-05-13         B        10   10.0
2  2016-05-13         A         6    1.8

这篇关于为什么 pandas 不允许在groupby中使用分类列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆