在数据帧级联时保留类别dtype [英] Retaining categorical dtype upon dataframe concatenation

查看:72
本文介绍了在数据帧级联时保留类别dtype的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有相同列名和dtypes的数据框,类似于以下内容:

I have two dataframes with identical column names and dtypes, similar to the following:

A             object
B             category
C             category

每个数据框中的类别都不相同.

The categories are not identical in each of the dataframes.

正常情况下,熊猫会输出:

When normally concatinating, pandas outputs:

A             object
B             object
C             object

根据文档,这是预期的行为.

Which is the expected behaviour as per the documentation.

但是,我希望保持分类并希望合并类别,因此我尝试对数据框中属于分类的列使用union_categoricals. cdfdf是我的两个数据帧.

However, I wish to keep the categorisation and wish to union the categories, so I have tried the union_categoricals across the columns in the dataframe which are both categorical. cdf and df are my two dataframes.

for column in df:
    if df[column].dtype.name == "category" and cdf[column].dtype.name == "category":
        print (column)
        union_categoricals([cdf[column], df[column]], ignore_order=True)

cdf = pd.concat([cdf,df])

这仍然不能为我提供分类输出.

This is still not providing me with a categorical output.

推荐答案

我认为这在文档中并不十分明显,但是您可以执行以下操作.以下是一些示例数据:

I don't think this is completely obvious from the documentation, but you could do something like the following. Here's some sample data:

df1=pd.DataFrame({'x':pd.Categorical(['dog','cat'])})
df2=pd.DataFrame({'x':pd.Categorical(['cat','rat'])})

使用union_categoricals1获得一致的类别和数据框.如果您需要使自己确信这可行,请尝试df.x.cat.codes.

Use union_categoricals1 to get consistent categories accros dataframes. Try df.x.cat.codes if you need to convince yourself that this works.

from pandas.api.types import union_categoricals

uc = union_categoricals([df1.x,df2.x])
df1.x = pd.Categorical( df1.x, categories=uc.categories )
df2.x = pd.Categorical( df2.x, categories=uc.categories )

连接并确认dtype是分类的.

Concatenate and verify the dtype is categorical.

df3 = pd.concat([df1,df2])

df3.x.dtypes
category

正如@ C8H10N4O2所建议的那样,您还可以在连接后将对象强制转换回类别.老实说,对于较小的数据集,我认为这是最好的方法,因为它更简单.但是对于较大的数据帧,使用union_categoricals应该会大大提高内存效率.

As @C8H10N4O2 suggests, you could also just coerce from objects back to categoricals after concatenating. Honestly, for smaller datasets I think that's the best way to do it just because it's simpler. But for larger dataframes, using union_categoricals should be much more memory efficient.

这篇关于在数据帧级联时保留类别dtype的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆