并非所有类别都存在时的虚拟变量 [英] Dummy variables when not all categories are present

查看:88
本文介绍了并非所有类别都存在时的虚拟变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数据框,其中的一列包含分类变量.我想将其转换为几个虚拟变量,在这种情况下,通常使用get_dummies.

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.

发生的事情是,get_dummies查看每个数据帧中的可用数据,以找出有多少类别,从而创建适当数量的虚拟变量.但是,在我现在正在解决的问题中,我实际上实际上预先知道了可能的类别.但是,当单独查看每个数据框时,不一定会出现所有类别.

What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.

我的问题是:是否有一种方法可以将类别名称传递给get_dummies(或等效函数),以便对于未出现在给定数据框中的类别,只需创建一列0?

My question is: is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

可以做到这一点的东西:

Something that would make this:

categories = ['a', 'b', 'c']

   cat
1   a
2   b
3   a

成为这个:

  cat_a  cat_b  cat_c
1   1      0      0
2   0      1      0
3   1      0      0

推荐答案

使用转置和重新索引

import pandas as pd

cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})

dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)

print dummies

    a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0

这篇关于并非所有类别都存在时的虚拟变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆