来自 Pandas get_dummies 的重复列 [英] Duplicate columns from Pandas get_dummies

查看:49
本文介绍了来自 Pandas get_dummies 的重复列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

取如下数据集(df.head()的输出)

Taken a data set like the following (output from df.head())

individual  states
1           Alaska, Hawaii 
2           Hawaii, Alaska
3           Kansas, Iowa, Maryland
4           New Jersey, Newada
5           Newada, New Jersey

如果我跑

df['states'].str.get_dummies(sep=',')

我得到以下内容

    Hawaii  Iowa    Maryland    New Jersey  Newada  Alaska  Hawaii  Kansas  New Jersey  Newada
0   1   0   0   0   0   1   0   0   0   0
1   0   0   0   0   0   1   1   0   0   0
2   0   1   1   0   0   0   0   1   0   0
3   0   0   0   0   1   0   0   0   1   0
4   0   0   0   1   0   0   0   0   0   1

注意重复(重复)的列.多个列出现的值不同,所以我不能直接删除它们.问题从何而来,我该如何正确处理?提前致谢!

Note the duplicate (repeated) columns. The values differ between multiple column occurences, so I can't just drop them. Where is the problem coming from, how do I do it right? Thanks in advance!

推荐答案

问题是分隔符,需要', ',否则得到一些带空格的列名,没有和没有什么不同,所以新列已创建:

Problem is separator, need ', ', else get some columns names with spaces, what are different like without, so new columns are created:

df1 = df['states'].str.get_dummies(sep=',')

print (df1.columns)
Index([' Alaska', ' Hawaii', ' Iowa', ' Maryland', ' New Jersey', ' Newada',
       'Alaska', 'Hawaii', 'Kansas', 'New Jersey', 'Newada'],
      dtype='object')

<小时>

print (df1)
    Alaska   Hawaii   Iowa   Maryland   New Jersey   Newada  Alaska  Hawaii  \
0        0        1      0          0            0        0       1       0   
1        1        0      0          0            0        0       0       1   
2        0        0      1          1            0        0       0       0   
3        0        0      0          0            0        1       0       0   
4        0        0      0          0            1        0       0       0   

   Kansas  New Jersey  Newada  
0       0           0       0  
1       0           0       0  
2       1           0       0  
3       0           1       0  
4       0           0       1  

<小时>

df2 = df['states'].str.get_dummies(sep=', ')
print (df2)
   Alaska  Hawaii  Iowa  Kansas  Maryland  New Jersey  Newada
0       1       1     0       0         0           0       0
1       1       1     0       0         0           0       0
2       0       0     1       1         1           0       0
3       0       0     0       0         0           1       1
4       0       0     0       0         0           1       1

这篇关于来自 Pandas get_dummies 的重复列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆