pandas 数据透视表-有序类别导致意外的边距 [英] pandas pivot table - ordered categories causing unexpected margins

查看:96
本文介绍了 pandas 数据透视表-有序类别导致意外的边距的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用python 3.7和pandas 0.23.4.我正在尝试使用有序的分类数据制作数据透视表.如果我包括边距,则小计的顺序似乎不正确.

import pandas as pd
m='male'
f='female'

data = {'num': [0,1,2,3,4,5,6,7,8,9],
        'age': [1,2,2,3,3,3,3,1,2,3],
        'sex': [f,f,f,f,f,f,f,m,m,m]}
df = pd.DataFrame(data=data)

df['age1'] = pd.Categorical(df['age'],categories=[3,2,1],ordered=True)
df['sex1'] = pd.Categorical(df['sex'],categories=[m,f],ordered=True)
pd.pivot_table(df,values='num',index='age1',columns='sex1',aggfunc='count',margins=True)

输出(页边距顺序错误,所有"总和不在正确的行或列中):

sex1  male  female  All
age1                   
3        1       4    2
2        1       2    3
1        1       1    5
All      7       3   10

预期输出(正确的边距顺序):

sex1  male  female  All
age1                   
3        1       4    5
2        1       2    3
1        1       1    2
All      3       7   10

在此示例中,最好使用ordered = False创建类别.但是,我的很多数据都是自动排序的(使用pd.cut),所以我想知道这是否是预期的行为,如果是,是否有办法删除使用订单创建的类别上的排序?

编辑-这是一个使用pd.cut的示例.我将年龄"列的值更改为与剪切顺序相反.

import pandas as pd
m='male'
f='female'
data = {'num': [0,1,2,3,4,5,6,7,8,9],
        'age': [3,3,3,3,2,2,1,1,2,3],
        'sex': [f,f,f,f,f,f,f,m,m,m]}
df = pd.DataFrame(data=data)
df['cut'] = pd.cut(df['age'],[1,2,3,4],labels=['<2','2','>2'],right=False)
pd.pivot_table(df,values='num',index='cut',columns='sex',aggfunc='count',margins=True)

输出,再次使用不正确的行边距(对应于pd.cut中的排序类别).

sex  female  male  All
cut                   
<2        1     1    5
2         2     1    3
>2        4     1    2
All       7     3   10

预期的输出将是正确的行边距顺序.

sex  female  male  All
cut                   
<2        1     1    2
2         2     1    3
>2        4     1    5
All       7     3   10

解决方案

以下是针对您的原始问题的解决方案 .从df['age1']df['sex1']

中删除有序参数

我的更改:

import pandas as pd
m = 'male'
f = 'female'

data = {'num': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
        'age': [1, 2, 2, 3, 3, 3, 3, 1, 2, 3],
        'sex': [f, f, f, f, f, f, f, m, m, m]}
df = pd.DataFrame(data=data)

df['age1'] = pd.Categorical(df['age'], categories=[3, 2, 1])
df['sex1'] = pd.Categorical(df['sex'], categories=[m, f])
pd.pivot_table(df, values='num', index='age1',
               columns='sex1', aggfunc='count', margins=True)

输出:

sex  male   female  All
age1            
3     1     4        5
2     1     2        3
1     1     1        2
All   3     7       10

来自文档: 是否将此分类视为有序分类.如果为True,则将对结果分类进行排序. 排序的分类方面,在排序时,其类别属性的顺序

因此,当您将 ordered = True 传递给pd时,分类表将不会根据类别对聚合进行排序,但变量会出现在数据框中.如果看一下数据框,那么女性在男性之前要先于男性,这样才能保持合计.

因此要回答您的问题,这是有序的预期行为,因此,如果要在脚本中的其他位置对数据框进行排序,则应格外小心.

Using python 3.7 and pandas 0.23.4. I'm trying to make pivot tables with ordered categorical data. If I include margins, the subtotals don't seem to be in the correct order.

import pandas as pd
m='male'
f='female'

data = {'num': [0,1,2,3,4,5,6,7,8,9],
        'age': [1,2,2,3,3,3,3,1,2,3],
        'sex': [f,f,f,f,f,f,f,m,m,m]}
df = pd.DataFrame(data=data)

df['age1'] = pd.Categorical(df['age'],categories=[3,2,1],ordered=True)
df['sex1'] = pd.Categorical(df['sex'],categories=[m,f],ordered=True)
pd.pivot_table(df,values='num',index='age1',columns='sex1',aggfunc='count',margins=True)

Output (incorrect margins order, the 'All' sums are not in the right rows or columns):

sex1  male  female  All
age1                   
3        1       4    2
2        1       2    3
1        1       1    5
All      7       3   10

Expected output (correct margins order):

sex1  male  female  All
age1                   
3        1       4    5
2        1       2    3
1        1       1    2
All      3       7   10

In this example it might be better to create the categories with ordered=False. However much of my data is automatically ordered (using pd.cut) so I would like to know if this is intended behavior, and if so, is there a way to remove the ordering on a category that was created with an order?

Edit- here's an example using pd.cut. I changed the 'age' column values to appear in reverse of the cut order.

import pandas as pd
m='male'
f='female'
data = {'num': [0,1,2,3,4,5,6,7,8,9],
        'age': [3,3,3,3,2,2,1,1,2,3],
        'sex': [f,f,f,f,f,f,f,m,m,m]}
df = pd.DataFrame(data=data)
df['cut'] = pd.cut(df['age'],[1,2,3,4],labels=['<2','2','>2'],right=False)
pd.pivot_table(df,values='num',index='cut',columns='sex',aggfunc='count',margins=True)

Output, again with incorrect row margins (corresponding to the ordered category from pd.cut).

sex  female  male  All
cut                   
<2        1     1    5
2         2     1    3
>2        4     1    2
All       7     3   10

Expected output would be the correct row margin order.

sex  female  male  All
cut                   
<2        1     1    2
2         2     1    3
>2        4     1    5
All       7     3   10

解决方案

Here is a fix For your original question. Remove the ordered argument from df['age1'] and df['sex1']

My changes:

import pandas as pd
m = 'male'
f = 'female'

data = {'num': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
        'age': [1, 2, 2, 3, 3, 3, 3, 1, 2, 3],
        'sex': [f, f, f, f, f, f, f, m, m, m]}
df = pd.DataFrame(data=data)

df['age1'] = pd.Categorical(df['age'], categories=[3, 2, 1])
df['sex1'] = pd.Categorical(df['sex'], categories=[m, f])
pd.pivot_table(df, values='num', index='age1',
               columns='sex1', aggfunc='count', margins=True)

Output:

sex  male   female  All
age1            
3     1     4        5
2     1     2        3
1     1     1        2
All   3     7       10

From the docs: Whether or not this categorical is treated as a ordered categorical. If True, the resulting categorical will be ordered. An ordered categorical respects, when sorted, the order of its categories attribute

So it looks like when you pass ordered=True into pd.Categorical the pivot table will not sort the aggregate based on the category but with the variable appears in your dataframe. If you look at your dataframe, female comes before male so ordered will keep that sort for the aggregate.

So to answer your question this is the intended behavior of ordered, and you should be careful doing this if you will be sorting your dataframe elsewhere in your script.

这篇关于 pandas 数据透视表-有序类别导致意外的边距的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆