pandas -将分类列转换为二进制编码形式 [英] Pandas - Convert a categorical column to binary encoded form
问题描述
我有一个看起来像这样的数据集-
I have a dataset that looks like so -
yyyy month tmax tmin
0 1908 January 5.0 -1.4
1 1908 February 7.3 1.9
2 1908 March 6.2 0.3
3 1908 April 7.4 2.1
4 1908 May 16.5 7.7
5 1908 June 17.7 8.7
6 1908 July 20.1 11.0
7 1908 August 17.5 9.7
8 1908 September 16.3 8.4
9 1908 October 14.6 8.0
10 1908 November 9.6 3.4
11 1908 December 5.8 -0.3
12 1909 January 5.0 0.1
13 1909 February 5.5 -0.3
14 1909 March 5.6 -0.3
15 1909 April 12.2 3.3
16 1909 May 14.7 4.8
17 1909 June 15.0 7.5
18 1909 July 17.3 10.8
19 1909 August 18.8 10.7
20 1909 September 14.5 8.1
21 1909 October 12.9 6.9
22 1909 November 7.5 1.7
23 1909 December 5.3 0.4
24 1910 January 5.2 -0.5
...
它具有四个变量-yyyy
,month
,tmax
(最高温度)和tmin
It has four variables - yyyy
, month
, tmax
(maximum temperature) and tmin
我想在预测时将month列用作变量,因此要将其转换为二进制编码版本.本质上,我想将十二个变量添加到名为January
的数据集中,直到December
,如果特定行的月份为"January",则列January
应当标记为1
,其余新添加的变量11列应为0
.
I want to use the month column as a variable while predictions and so want to convert it to its binary encoded version. Essentially, I want to add twelve variables to the dataset named January
until December
and if a particular row has month as "January" then the column January
should be marked as 1
and the remaining of the newly added 11 columns should be 0
.
我查看了数据透视表,但这对我的事业没有帮助.关于如何以一种简单而优雅的方式做到这一点的任何想法?
I looked into pivot tables but that doesn't help my cause. Any ideas on how to do this in a simple elegant way?
推荐答案
我认为您需要 get_dummies
:
I think you need get_dummies
:
df = pd.get_dummies(df['month'])
如果需要在原始列中添加新列并删除month
,请使用 join
与 pop
:
And if need add new columns to original and remove month
use join
with pop
:
df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
yyyy tmax tmin April August December February January July June \
0 1908 5.0 -1.4 0 0 0 0 1 0 0
1 1908 7.3 1.9 0 0 0 1 0 0 0
2 1908 6.2 0.3 0 0 0 0 0 0 0
3 1908 7.4 2.1 1 0 0 0 0 0 0
4 1908 16.5 7.7 0 0 0 0 0 0 0
March May November October September
0 0 0 0 0 0
1 0 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 1 0 0 0
如果不需要,请删除列month
:
If NOT need remove column month
:
df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
yyyy month tmax tmin April August December February January \
0 1908 January 5.0 -1.4 0 0 0 0 1
1 1908 February 7.3 1.9 0 0 0 1 0
2 1908 March 6.2 0.3 0 0 0 0 0
3 1908 April 7.4 2.1 1 0 0 0 0
4 1908 May 16.5 7.7 0 0 0 0 0
July June March May November October September
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0
If need sort columns there is more possible solutions - use reindex
or reindex_axis
:
months = ['January', 'February', 'March','April' ,'May', 'June', 'July', 'August', 'September','October', 'November','December']
df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
或将列month
转换为有序分类:
df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
这篇关于 pandas -将分类列转换为二进制编码形式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!