字符串分类特征的一种热编码 [英] One hot encoding of string categorical features

查看:28
本文介绍了字符串分类特征的一种热编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对一个简单的数据集执行单热编码.

I'm trying to perform a one hot encoding of a trivial dataset.

data = [['a', 'dog', 'red']
        ['b', 'cat', 'green']]

使用 Scikit-Learn 预处理这些数据的最佳方法是什么?

What's the best way to preprocess this data using Scikit-Learn?

根据直觉,您会关注 Scikit-Learn 的 OneHotEncoder.但是一个热编码器不支持字符串作为特征;它只离散整数.

On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.

那么您将使用 LabelEncoder,这会将字符串编码为整数.但是,您必须将标签编码器应用于每一列并存储这些标签编码器中的每一个(以及应用它们的列).这感觉非常笨重.

So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.

那么,在 Scikit-Learn 中最好的方法是什么?

So, what's the best way to do it in Scikit-Learn?

请不要推荐pandas.get_dummies.这就是我现在通常使用的一种热门编码.但是,它的局限性在于您无法单独对训练/测试集进行编码.

Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.

推荐答案

如果你在 sklearn>0.20.dev0

If you are on sklearn>0.20.dev0

In [11]: from sklearn.preprocessing import OneHotEncoder
    ...: cat = OneHotEncoder()
    ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
    ...: cat.fit_transform(X).toarray()
    ...: 
Out[11]: array([[1., 0., 0., 1., 0.],
           [0., 1., 0., 0., 1.],
           [1., 0., 0., 1., 0.],
           [0., 0., 1., 0., 1.]])

如果你在 sklearn==0.20.dev0

If you are on sklearn==0.20.dev0

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]:
array([[ 1.,  0., 0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

另一种方法是使用 category_encoders.

Another way to do it is to use category_encoders.

这是一个例子:

% pip install category_encoders
import category_encoders as ce
le =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
le.fit_transform(X)
array([[1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1]])

这篇关于字符串分类特征的一种热编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆