pandas -将分类列转换为二进制编码形式 [英] Pandas - Convert a categorical column to binary encoded form

查看:85
本文介绍了 pandas -将分类列转换为二进制编码形式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据集-

I have a dataset that looks like so -

     yyyy      month        tmax         tmin
0    1908    January         5.0         -1.4
1    1908   February         7.3          1.9
2    1908      March         6.2          0.3
3    1908      April         7.4          2.1
4    1908        May        16.5          7.7
5    1908       June        17.7          8.7
6    1908       July        20.1         11.0
7    1908     August        17.5          9.7
8    1908  September        16.3          8.4
9    1908    October        14.6          8.0
10   1908   November         9.6          3.4
11   1908   December         5.8         -0.3
12   1909    January         5.0          0.1
13   1909   February         5.5         -0.3
14   1909      March         5.6         -0.3
15   1909      April        12.2          3.3
16   1909        May        14.7          4.8
17   1909       June        15.0          7.5
18   1909       July        17.3         10.8
19   1909     August        18.8         10.7
20   1909  September        14.5          8.1
21   1909    October        12.9          6.9
22   1909   November         7.5          1.7
23   1909   December         5.3          0.4
24   1910    January         5.2         -0.5
...

它具有四个变量-yyyymonthtmax(最高温度)和tmin

It has four variables - yyyy, month, tmax(maximum temperature) and tmin

我想在预测时将month列用作变量,因此要将其转换为二进制编码版本.本质上,我想将十二个变量添加到名为January的数据集中,直到December,如果特定行的月份为"January",则列January应当标记为1,其余新添加的变量11列应为0.

I want to use the month column as a variable while predictions and so want to convert it to its binary encoded version. Essentially, I want to add twelve variables to the dataset named January until December and if a particular row has month as "January" then the column January should be marked as 1 and the remaining of the newly added 11 columns should be 0.

我查看了数据透视表,但这对我的事业没有帮助.关于如何以一种简单而优雅的方式做到这一点的任何想法?

I looked into pivot tables but that doesn't help my cause. Any ideas on how to do this in a simple elegant way?

推荐答案

我认为您需要 get_dummies :

I think you need get_dummies:

df = pd.get_dummies(df['month'])

如果需要在原始列中添加新列并删除month,请使用 join pop :

And if need add new columns to original and remove month use join with pop:

df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
   yyyy  tmax  tmin  April  August  December  February  January  July  June  \
0  1908   5.0  -1.4      0       0         0         0        1     0     0   
1  1908   7.3   1.9      0       0         0         1        0     0     0   
2  1908   6.2   0.3      0       0         0         0        0     0     0   
3  1908   7.4   2.1      1       0         0         0        0     0     0   
4  1908  16.5   7.7      0       0         0         0        0     0     0   

   March  May  November  October  September  
0      0    0         0        0          0  
1      0    0         0        0          0  
2      1    0         0        0          0  
3      0    0         0        0          0  
4      0    1         0        0          0  

如果不需要,请删除列month:

If NOT need remove column month:

df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
   yyyy     month  tmax  tmin  April  August  December  February  January  \
0  1908   January   5.0  -1.4      0       0         0         0        1   
1  1908  February   7.3   1.9      0       0         0         1        0   
2  1908     March   6.2   0.3      0       0         0         0        0   
3  1908     April   7.4   2.1      1       0         0         0        0   
4  1908       May  16.5   7.7      0       0         0         0        0   

   July  June  March  May  November  October  September  
0     0     0      0    0         0        0          0  
1     0     0      0    0         0        0          0  
2     0     0      1    0         0        0          0  
3     0     0      0    0         0        0          0  
4     0     0      0    1         0        0          0  

如果需要排序列,则有更多可能的解决方案-使用 :

If need sort columns there is more possible solutions - use reindex or reindex_axis:

months = ['January', 'February', 'March','April' ,'May',  'June', 'July', 'August', 'September','October', 'November','December']

df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0  

df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0  

或将列month转换为有序分类:

df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0  

这篇关于 pandas -将分类列转换为二进制编码形式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆