一字编码,用于出现在多列中的单词 [英] One-hot encoding for words which occur in multiple columns

查看:47
本文介绍了一字编码,用于出现在多列中的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从分类数据中创建热编码数据,您可以在此处看到.

I want to create on-hot encoded data from categorical data, which you can see here.

        Label1          Label2        Label3  
0   Street fashion        Clothing       Fashion
1         Clothing       Outerwear         Jeans
2     Architecture        Property      Clothing
3         Clothing           Black      Footwear
4            White      Photograph        Beauty

(对我而言)问题是,一个特定的标签(例如衣服)可以位于label1,label2或label 3中.我尝试了 pd.get_dummies ,但这创建了如下数据:

The problem (for me) is that one specific label (e.g. clothing) can be in label1, label2 or label 3. I tried pd.get_dummies but this created data like:

Label1_Clothing  Label2_Clothing    Label3_Clothing  
0      0                 1                 0
1      1                 0                 0
2      0                 0                 1

是否有一种方法,每个标签只有一个虚拟变量列?而是:

Is there a way to only have one dummy variable column for each label? So rather:

Label_Clothing  Label_Street Fashion    Label_Architecture  
0      1                 1                 0
1      1                 0                 0
2      1                 0                 1

我刚开始编程,很高兴为您提供帮助.

I am pretty new to programming and would be very glad for your help.

最好,贝尔纳多

推荐答案

您可以将数据框堆叠到单个 Series 中,然后从中获得虚拟对象.从那里开始,利用最大的外部层将数据折叠回其原始形状,同时保持标签的位置:

You can stack your dataframe into a single Series then get the dummies from that. From there you take the maximum of the outer level to collapse the data back to its original shape while maintaining the position of the labels:

dummies = pd.get_dummies(df.stack()).max(level=0)

print(dummies)
   Architecture  Beauty  Black  Clothing  Fashion  Footwear  Jeans  Outerwear  Photograph  Property  Street fashion  White
0             0       0      0         1        1         0      0          0           0         0               1      0
1             0       0      0         1        0         0      1          1           0         0               0      0
2             1       0      0         1        0         0      0          0           0         1               0      0
3             0       0      1         1        0         1      0          0           0         0               0      0
4             0       1      0         0        0         0      0          0           1         0               0      1

这篇关于一字编码,用于出现在多列中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆