scikit-learn:字符串分类功能的一个热编码 [英] scikit-learn: One hot encoding of string categorical features

查看:245
本文介绍了scikit-learn:字符串分类功能的一个热编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行一个琐碎数据集的一个热编码。

  data = [['a','dog','red'] 
['b' cat','green']]

使用Scikit-Learn来预处理这些数据的最好方法是什么?



在第一本能的时候,你会看到Scikit-Learn的 OneHotEncoder 。但是一个热编码器不支持字符串作为特征;它只会离散整数。



所以你会使用一个 LabelEncoder ,它将把字符串编码成整数。但是,您必须将标签编码器应用到每个列中,并存储这些标签编码器中的每一个(以及它们应用的列)。这个感觉非常笨重。



那么,在Scikit-Learn中,最好的方法是什么?



请不要建议 pandas.get_dummies 。这就是我现在通常使用的一个热编码。但是,它限制了你不能单独编码你的训练/测试集。

解决方案

非常好的问题。 / p>

然而,在某种意义上,这是一个私人案例,至少对我而言(至少对我来说),给出了 sklearn 阶段适用于 X 矩阵的子集,我想应用(可能几个)给定整个矩阵。在这里,例如,您有一个阶段可以在单个列中运行,而您希望每列应用一次。



这是一个使用复合设计模式的经典案例。



这是一个可重复使用的阶段(草图),它接受将列索引映射到转换中以应用于其中的字典:

  class ColumnApplier(object):
def __init __(self,column_stages):
self._column_stages = column_stages

def fit(self,X,y):
对于i,k in self._column_stages.items():
k.fit(X [:, i])

return self

def transform(self ,X):
X = X.copy()
for i,k in self._column_stages.items():
X [:, i] = k.transform(X [: ,i])

返回X

现在,在这个背景下,开始于

  X = np.array([['a','dog','red'],['b ','cat','green']])
y = np.array([1,2])
X

您只需使用它将每个列索引映射到所需的转换:

  multi_encoder = \ 
ColumnApplier(dict([(i,preprocessing.LabelEncoder())for i in range(3)]))
multi_encoder.fit(X,None).transform(X)

一旦你开发了这样的一个阶段(我不能发布我使用的一个),你可以使用各种设置一遍又一遍。


I'm trying to perform a one hot encoding of a trivial dataset.

data = [['a', 'dog', 'red']
        ['b', 'cat', 'green']]

What's the best way to preprocess this data using Scikit-Learn?

On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.

So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.

So, what's the best way to do it in Scikit-Learn?

Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.

解决方案

Very nice question.

However, in some sense, it is a private case of something that comes up (at least for me) rather often - given sklearn stages applicable to subsets of the X matrix, I'd like to apply (possibly several) given the entire matrix. Here, for example, you have a stage which knows to run on a single column, and you'd like to apply it thrice - once per column.

This is a classic case for using the Composite Design Pattern.

Here is a (sketch of a) reusable stage that accepts a dictionary mapping a column index into the transformation to apply to it:

class ColumnApplier(object):
    def __init__(self, column_stages):
        self._column_stages = column_stages

    def fit(self, X, y):
        for i, k in self._column_stages.items():
            k.fit(X[:, i])

        return self

    def transform(self, X):
        X = X.copy()
        for i, k in self._column_stages.items():
            X[:, i] = k.transform(X[:, i])

        return X

Now, to use it in this context, starting with

X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
y = np.array([1, 2])
X

you would just use it to map each column index to the transformation you want:

multi_encoder = \
    ColumnApplier(dict([(i, preprocessing.LabelEncoder()) for i in range(3)]))
multi_encoder.fit(X, None).transform(X)

Once you develop such a stage (I can't post the one I use), you can use it over and over for various settings.

这篇关于scikit-learn:字符串分类功能的一个热编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆