如何使用逗号分隔的值列制作虚拟变量? [英] How to make dummy variables with comma separated valued columns?
问题描述
我正在为机器学习进行数据预处理,但是遇到了问题.
这是我想做的.
I am working on data preprocessing for machine learning and faced a problem.
Here is what I want to do.
表格图片:
表的类型为pandas数据框.
Table's type is pandas dataframe.
我的当前表是左表,我想将表转换为右表.
My current table is left one, and I want to transform my table to right one.
电影和演员的数目不是固定的.
The number of movies and actors are not fixed.
数据输入
df=pd.DataFrame({'name':['A','B','C'],'actors':['a,b','b,d','c,m']})
预期输出:
a b c d m
A 1 1 0 0 0
B 0 1 0 1 0
C 0 0 1 0 1
推荐答案
尝试一下? (顺便说一句,kaggle电影数据集,最好使用 LabelEncoder
)
Try this ? (BTW , kaggle movie dataset, better using LabelEncoder
)
PS:我没有添加列name
,您可以简单地执行out['name']=df.name
PS: I did not add the column name
, you can simply do out['name']=df.name
选项1 pd.crosstab
df.actors=df.actors.str.split(',')
df1=df.set_index('name').actors.apply(pd.Series).stack()
pd.crosstab(df1.index.get_level_values(0),df1).rename_axis(None).rename_axis(None,1)
Out[246]:
a b c d m
A 1 1 0 0 0
B 0 1 0 1 0
C 0 0 1 0 1
选项2
get_dummies
Option 2
get_dummies
pd.get_dummies(df.actors.str.split(',').apply(pd.Series).stack()).sum(level=0)
Out[230]:
a b c d m
0 1 1 0 0 0
1 0 1 0 1 0
2 0 0 1 0 1
选项3
MultiLabelBinarizer
Option 3
MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.actors.str.split(',')),columns=mlb.classes_,index=df.name).reset_index()
Out[238]:
name a b c d m
0 A 1 1 0 0 0
1 B 0 1 0 1 0
2 C 0 0 1 0 1
数据输入
Data Input
df=pd.DataFrame({'name':['A','B','C'],'actors':['a,b','b,d','c,m']})
这篇关于如何使用逗号分隔的值列制作虚拟变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!