pandas 在多列上的get_dummies [英] Pandas get_dummies on multiple columns
问题描述
我有一个包含多个列的数据集,我希望对其进行一次热编码.但是,我不想为每个编码都有编码,因为所说的列与所说的项目有关.我想要的是一组使用所有列的虚拟变量.请参阅我的代码以获得更好的解释.
I have a dataset with multiple columns that I wish to one hot encode. However, I don't want to have the encoding for each one of them since said columns are related to the said items. What I want is one "set" of dummies variables that uses all the columns. See my code for a better explanation.
假设我的数据框如下所示:
Suppose my dataframe looks like this:
In [103]: dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})
In [104]: dum
Out[104]:
ch1 ch2 ch3
0 A B C
1 C G D
2 A F E
如果我执行
pd.get_dummies(dum)
输出将是
ch1_A ch1_C ch2_B ch2_F ch2_G ch3_C ch3_D ch3_E
0 1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 1 0
2 1 0 0 1 0 0 0 1
但是,我想获得的是这样的东西:
However, what I would like to obtain is something like this:
A B C D E F G
1 1 1 0 0 0 0
0 0 1 1 0 0 1
1 0 0 0 1 1 0
而不是用多列表示编码,例如ch1_A
和ch1_C
,当列ch1
,ch2
,ch3
出现.
Instead of having multiple columns representing the encoding, e.g. ch1_A
and ch1_C
, I only wish to have one group (A
, B
, and so on) with value 1
when any of the values in the columns ch1
, ch2
, ch3
show up.
为澄清起见,在我的原始数据集中,单行不会多次包含相同的值(A,B,C ...);它只会出现在其中一列上.
To clarify, in my original dataset, a single row won't contain the same value (A,B,C...) more than once; it will just appear on one of the columns.
推荐答案
使用stack
和str.get_dummies
dum.stack().str.get_dummies().sum(level=0)
Out[938]:
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
这篇关于 pandas 在多列上的get_dummies的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!