OneHotEncoder - 仅编码一些分类变量列 [英] OneHotEncoder - encoding only some of categorical variable columns

查看:63
本文介绍了OneHotEncoder - 仅编码一些分类变量列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个带有以下列名称的 Pandas 数据框:

  • 'age'(例如 33、26、51 等)
  • 'seniority'(例如'junior'、'senior'等)
  • 'gender'(例如'男'、'女')
  • 'salary'(例如 32000、40000、64000 等)

我想将 seniority 分类变量转换为一个热编码值.为此,我正在执行以下操作:

from sklearn.preprocessing import LabelEncoderlabel_encoder = LabelEncoder()数据['seniority'] = label_encoder.fit_transform(data['seniority'])从 sklearn.preprocessing 导入 OneHotEncoderone_hot_encoder = OneHotEncoder(categorical_features=[1])数据 = one_hot_encoder.fit_transform(data.values)

但后来我收到此错误

ValueError: 无法将字符串转换为浮点数:'gender'

在线

data = one_hot_encoder.fit_transform(data.values)

但是,我已经明确指定了 categorical_features=[1] 所以只有第 1 列(seniority)应该被考虑用于这个热编码.

如何修复此错误(例如删除性别"列除外)?

我过去使用 pandas.get_dummies 并且我没有遇到这个问题.

解决方案

我认为对于这种情况你应该坚持 pd.get_dummies:

<预><代码>>>>数据年龄 资历 性别 工资0 1 初中男 51 2 高级女性 62 3 初中女性 7# 一种使用 get_dummies 的热编码数据 = pd.concat((data,pd.get_dummies(data.seniority)),1)>>>数据年龄 资历 性别 工资 大三0 1 初中男 5 1 01 2 高级女性 6 0 12 3 初中女性 7 1 0

问题在于 sklearnOneHotEncoder 需要有一个整数数组作为输入.但是在数组data.values 中,你仍然有gender 的字符串表示.如果您愿意,您可以只对资历值进行热编码,但是如果您想知道这些功能的含义,这不是很好,您必须手动将列名传递给它(这在很多情况下是不可行的)):

from sklearn.preprocessing import LabelEncoderlabel_encoder = LabelEncoder()数据['seniority'] = label_encoder.fit_transform(data['seniority'])从 sklearn.preprocessing 导入 OneHotEncoderone_hot_encoder = OneHotEncoder(sparse=False)data[['junior','senior']] = one_hot_encoder.fit_transform(data['seniority'].values.reshape(-1,1))>>>数据年龄 资历 性别 工资 大三0 1 0 男性 5 1.0 0.01 2 1 女性 6 0.0 1.02 3 0 女性 7 1.0 0.0

或者,如果功能名称无关紧要:

from sklearn.preprocessing import LabelEncoderlabel_encoder = LabelEncoder()数据['seniority'] = label_encoder.fit_transform(data['seniority'])从 sklearn.preprocessing 导入 OneHotEncoderone_hot_encoder = OneHotEncoder(sparse=False)data = pd.concat((data,pd.DataFrame(one_hot_encoder.fit_transform(data['seniority'].values.reshape(-1,1)))),1)年龄 资历 性别 工资 0 10 1 0 男性 5 1.0 0.01 2 1 女性 6 0.0 1.02 3 0 女性 7 1.0 0.0

但最终,pd.get_dummies 以更好的方式完成了这项工作 (IMO)

Let's assume that I have a pandas dataframe with the following column names:

  • 'age' (e.g. 33, 26, 51 etc)
  • 'seniority' (e.g. 'junior', 'senior' etc)
  • 'gender' (e.g. 'male', 'female')
  • 'salary' (e.g. 32000, 40000, 64000 etc)

I want to transform the seniority categorical variables to one hot encoded values. For this reason I am doing the following:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])

from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(categorical_features=[1])
data = one_hot_encoder.fit_transform(data.values)

But then I am getting this error

ValueError: could not convert string to float: 'gender'

at line

data = one_hot_encoder.fit_transform(data.values)

However, I have explicitly specified that categorical_features=[1] so only column 1 (seniority) should be considered for this one hot encoding.

How can I fix this error (except for example by dropping the column 'gender')?

I was using pandas.get_dummies in the past and I did not have this problem.

解决方案

I think for this case you should stick to pd.get_dummies:

>>> data
   age seniority  gender  salary
0    1    junior    male       5
1    2    senior  female       6
2    3    junior  female       7

# One hot encode with get_dummies
data = pd.concat((data,pd.get_dummies(data.seniority)),1)

>>> data
   age seniority  gender  salary  junior  senior
0    1    junior    male       5       1       0
1    2    senior  female       6       0       1
2    3    junior  female       7       1       0

The problem is that sklearn's OneHotEncoder needs to have an array of ints as input. But in the array data.values, you still have the string representation of gender. You could, if you wanted, just one hot encode the seniority values, but if you want to know the meaning of those features, it's not very nice, you have to pass it the column names manually (which is unfeasible in a lot of cases):

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])

from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
data[['junior','senior']] = one_hot_encoder.fit_transform(data['seniority'].values.reshape(-1,1))

>>> data
   age  seniority  gender  salary  junior  senior
0    1          0    male       5     1.0     0.0
1    2          1  female       6     0.0     1.0
2    3          0  female       7     1.0     0.0

Or, if the feature names don't matter:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['seniority'] = label_encoder.fit_transform(data['seniority'])

from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
data = pd.concat((data,pd.DataFrame(one_hot_encoder.fit_transform(data['seniority'].values.reshape(-1,1)))),1)

   age  seniority  gender  salary    0    1
0    1          0    male       5  1.0  0.0
1    2          1  female       6  0.0  1.0
2    3          0  female       7  1.0  0.0

But in the end, pd.get_dummies does the job in a much nicer way (IMO)

这篇关于OneHotEncoder - 仅编码一些分类变量列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆