一字编码,用于出现在多列中的单词 [英] One-hot encoding for words which occur in multiple columns
问题描述
我想从分类数据中创建热编码数据,您可以在此处看到.
I want to create on-hot encoded data from categorical data, which you can see here.
Label1 Label2 Label3
0 Street fashion Clothing Fashion
1 Clothing Outerwear Jeans
2 Architecture Property Clothing
3 Clothing Black Footwear
4 White Photograph Beauty
(对我而言)问题是,一个特定的标签(例如衣服)可以位于label1,label2或label 3中.我尝试了 pd.get_dummies
,但这创建了如下数据:
The problem (for me) is that one specific label (e.g. clothing) can be in label1, label2 or label 3. I tried pd.get_dummies
but this created data like:
Label1_Clothing Label2_Clothing Label3_Clothing
0 0 1 0
1 1 0 0
2 0 0 1
是否有一种方法,每个标签只有一个虚拟变量列?而是:
Is there a way to only have one dummy variable column for each label? So rather:
Label_Clothing Label_Street Fashion Label_Architecture
0 1 1 0
1 1 0 0
2 1 0 1
我刚开始编程,很高兴为您提供帮助.
I am pretty new to programming and would be very glad for your help.
最好,贝尔纳多
推荐答案
您可以将数据框堆叠到单个 Series
中,然后从中获得虚拟对象.从那里开始,利用最大的外部层将数据折叠回其原始形状,同时保持标签的位置:
You can stack your dataframe into a single Series
then get the dummies from that. From there you take the maximum of the outer level to collapse the data back to its original shape while maintaining the position of the labels:
dummies = pd.get_dummies(df.stack()).max(level=0)
print(dummies)
Architecture Beauty Black Clothing Fashion Footwear Jeans Outerwear Photograph Property Street fashion White
0 0 0 0 1 1 0 0 0 0 0 1 0
1 0 0 0 1 0 0 1 1 0 0 0 0
2 1 0 0 1 0 0 0 0 0 1 0 0
3 0 0 1 1 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0 1 0 0 1
这篇关于一字编码,用于出现在多列中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!