一种字符串分类特征的热编码 [英] One hot encoding of string categorical features

查看:86
本文介绍了一种字符串分类特征的热编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对琐碎的数据集进行一次热编码。

I'm trying to perform a one hot encoding of a trivial dataset.

data = [['a', 'dog', 'red']
        ['b', 'cat', 'green']]

使用Scikit-Learn预处理此数据的最佳方法是什么?

What's the best way to preprocess this data using Scikit-Learn?

乍一看,您会想到Scikit-Learn的 OneHotEncoder 。但是一个热编码器不支持将字符串作为功能;

On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.

因此,您将使用 LabelEncoder ,它将字符串编码为整数。但是,然后您必须将标签编码器应用于每个列,并存储这些标签编码器(以及它们所应用于的列)中的每个。

So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.

那么,在Scikit-Learn中,最佳方法是什么?

So, what's the best way to do it in Scikit-Learn?

请不要建议 pandas.get_dummies 。这就是我如今通常使用的一种热门编码。但是,它的局限性在于您不能单独对训练/测试集进行编码。

Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.

推荐答案

如果您使用的是sklearn > 0.20.dev0

If you are on sklearn>0.20.dev0

In [11]: from sklearn.preprocessing import OneHotEncoder
    ...: cat = OneHotEncoder()
    ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
    ...: cat.fit_transform(X).toarray()
    ...: 
Out[11]: array([[1., 0., 0., 1., 0.],
           [0., 1., 0., 0., 1.],
           [1., 0., 0., 1., 0.],
           [0., 0., 1., 0., 1.]])

如果您使用的是sklearn == 0.20.dev0

If you are on sklearn==0.20.dev0

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]:
array([[ 1.,  0., 0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

另一种方法

这里是一个示例:

% pip install category_encoders
import category_encoders as ce
le =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
le.fit_transform(X)
array([[1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1]])

这篇关于一种字符串分类特征的热编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆