处理未知值以进行标签编码 [英] Handling unknown values for label encoding

查看:118
本文介绍了处理未知值以进行标签编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在sk-learn中处理未知的标签编码值? 标签编码器只会爆炸,除非检测到新标签.

How can I handle unknown values for label encoding in sk-learn? The label encoder will only blow up with an exception that new labels were detected.

我想要的是通过 one 编码器对分类变量进行编码.但是,sk-learn不支持字符串.因此,我在每一列上都使用了标签编码器.

What I want is the encoding of categorical variables via one-hot-encoder. However, sk-learn does not support strings for that. So I used a label encoder on each column.

我的问题是,在管道的交叉验证步骤中,出现了未知标签. 基本的一键编码器可以选择忽略这种情况. 先验的pandas.getDummies /cat.codes是不够的,因为管道应该处理真实的,新鲜的传入数据,这些数据也可能包含未知标签.

My problem is that in my cross-validation step of the pipeline unknown labels show up. The basic one-hot-encoder would have the option to ignore such cases. An apriori pandas.getDummies /cat.codes is not sufficient as the pipeline should work with real-life, fresh incoming data which might contain unknown labels as well.

为此可以使用CountVectorizer吗?

推荐答案

使用scikit-learn处理此问题的更简单/更好的方法是使用类sklearn.preprocessing.OneHotEncoder

A more recent simpler/better way of handling this problem with scikit-learn is using the class sklearn.preprocessing.OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train)

enc.transform(train).toarray()

旧答案:

有很多答案提到了pandas.get_dummies作为此方法,但是我觉得labelEncoder方法更适合实现模型. 其他类似的答案提到使用DictVectorizer进行此操作,但是再次将整个DataFrame转换为dict可能不是一个好主意.

There are several answers that mention pandas.get_dummies as a method for this, but I feel the labelEncoder approach is cleaner for implementing a model. Other similar answers mention using DictVectorizer for this, but again converting the entire DataFrame to dict is probably not a great idea.

让我们假设以下问题列:

Let's assume the following problematic columns:

from sklearn import preprocessing
import numpy as np
import pandas as pd

train = {'city': ['Buenos Aires', 'New York', 'Istambul', 'Buenos Aires', 'Paris', 'Paris'],
        'letters': ['a', 'b', 'c', 'd', 'a', 'b']}
train = pd.DataFrame(train)

test = {'city': ['Buenos Aires', 'New York', 'Istambul', 'Buenos Aires', 'Paris', 'Utila'],
        'letters': ['a', 'b', 'c', 'a', 'b', 'b']}
test = pd.DataFrame(test)

乌提拉(Utila)是一个较为稀有的城市,它不存在于训练数据中,但在测试集中,我们可以在推断时考虑新数据.

Utila is a rarer city, and it isn't present in the training data but in the test set, that we can consider new data at inference time.

技巧是将这个值转换为其他"并将其包含在labelEncoder对象中.然后,我们可以在生产中重用它.

The trick is converting this value to "other" and including this in the labelEncoder object. Then we can reuse it in production.

c = 'city'
le = preprocessing.LabelEncoder()
train[c] = le.fit_transform(train[c])
test[c] = test[c].map(lambda s: 'other' if s not in le.classes_ else s)
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, 'other')
le.classes_ = le_classes
test[c] = le.transform(test[c])
test

  city  letters
0   1   a
1   3   b
2   2   c
3   1   a
4   4   b
5   0   b

要将其应用于新数据,我们需要为每一列保存一个le对象,这可以使用Pickle轻松完成.

To apply it to new data all we need is to save a le object for each column which can be easily done with Pickle.

此答案基于我认为不是的问题我不太清楚,因此添加了此示例.

This answer is based on this question which I feel wasn't totally clear to me, therefore added this example.

这篇关于处理未知值以进行标签编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆